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[57] ABSTRACT 

A method and system are provided for processing queries, 
where the queries do not reference a particular materialized 
view. Specifically, techniques are provided for handling a 
query that specifies a first set of one or more aggregate 
functions, where the particular materialized view reflects a 
second set of one or more aggregate functions. Whether the 
query can be rewritten is determined based on the aggregate 
functions in the first and second sets, and the corresponding 
arguments. Techniques are also provided for processing a 
query that (1) does not reference a particular materialized 
view, (2) specifies a first set of one or more aggregate 
functions, where the particular materialized view reflects a 
second set of one or more aggregate functions. A technique 
is also provided for rewriting queries that specify an outer 
join that has a dimension table on the child-side of the outer 
join and a fact table on the parent-side of the outer join. The 
query is rewritten to produce a rewritten query by replacing 
references to the fact table in the query with references to a 
materialized view. The rewritten query specifies an outer 
join that has the dimension table on the child side and the 
materialized view on the parent side. 

26 Claims, 11 Drawing Sheets 
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REWRITING A QUERY IN TERMS OF A 
SUMMARY BASED ON AGGREGATE 
COMPUTABILITY AND CANONICAL 
FORMAT, AND WHEN A DIMENSION TABLE 
IS ON THE CHILD SIDE OF AN OUTER 
JOIN 

FIELD OF THE INVENTION 

The present invention relates to database systems, and in 
particular, to rewriting queries to access data sources that are 
not specifically referenced in the queries. 

BACKGROUND OF THE INVENTION 

In a database management system (DBMS), data is stored 
in one or more data containers, each container contains 
records, and the data within each record is organized into 
one or more fields. In relational database systems, the data 
containers are referred to as tables, the records are referred 
to as rows, and the fields are referred to as columns. In object 
oriented databases, the data containers are referred to as 
object classes, the records are referred to as objects, and the 
fields are referred to as attributes. Other database architec- 
tures may use other terminology. 

The present invention is not limited to any particular type 
of data container or database architecture. However, for the 
purpose of explanation, the examples and the terminology 
used herein shall be that typically associated with relational 
databases. Thus, the terms "table", "row" and "column" 
shall be used herein to refer respectively to the data 
container, record, and field. 

For various reasons, it may not be desirable for certain 
users to have access to all of the columns of a table. For 
example, one column of an employee table may hold the 
salaries for the employees. Under these circumstances, it 
may be desirable to limit access to the salary column to 
management, and allow all employees to have access to the 
other columns. To address this situation, the employees may 
be restricted from directly accessing the table. Instead, they 
may be allowed to indirectly access the appropriate columns 
in the table through a "view". 

A view is a logical table. As logical tables, views may be 
queried by users as if they were a table. However, views 
actually present data that is extracted or derived from 
existing tables. Thus, the problem described above may be 
solved by (1) creating a view that extracts data from all 
columns of the employee table except the salary column, and 
(2) allowing all employees to access the view. 

A view is defined by metadata referred to as a view 
definition. The view definition contains mappings to one or 
more columns in the one or more tables containing the data. 
Typically, the view definition is in the form of a database 
query. Columns and tables that are mapped to a view are 
referred to herein as base columns and base tables of the 
view, respectively. The data maintained in the base columns 
is referred to herein as base data. 

The data presented by conventional views is gathered and 
derived on-the-fly from the base tables in response to queries 
that access the views. That data gathered for the view is not 
persistently stored after the query accessing the view has 
been processed. Because the data provided by conventional 
views is gathered from the base tables at the time the views 
are accessed, the data from the views will reflect the current 
state of the base tables. However, the overhead associated 
with gathering the data from the base tables for a view every 
time the view is accessed may be prohibitive. 
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A materialized view, on the other hand, is a view for 
which a copy of the view data is stored separate from the 
base tables from which the data was originally gathered and 
derived. The data contained in a materialized view is 

5 referred to herein as ("materialized data"). Materialized 
views eliminate the overhead associated with gathering and 
deriving the view data every time a query accesses the view. 
As new data is periodically added to the base tables, the 
materialized view needs to be updated (i.e., refreshed) to 

io reflect the new base data. 

Materialized views that are derived from more than one 
base table are created by performing a join between the base 
tables. A join is a query that combines rows from two or 
more tables, views, or materialized views. A join is per- 

15 formed whenever multiple tables appear in a query's FROM 
clause. The query's select list can select any columns from 
any of the base tables listed in the FROM clause. 

Most join queries contain WHERE clause conditions that 
compare two columns, each from a different table. Such a 

20 condition is called a join condition. To execute a join, the 
DBMS combines pairs of rows for which the join condition 
evaluates to TRUE, where each pair contains one row from 
each table. 

To execute a join of three or more tables, the DBMS first 
25 joins two of the tables based on the join conditions com- 
paring their columns and then joins the result to another 
table based on join conditions containing columns of the 
joined tables and the new table. The DBMS continues this 
process until all tables are joined into the result. 
30 In addition to join conditions, the WHERE clause of a join 
query can also contain other conditions that refer to columns 
of only one table. These conditions can further restrict the 
rows returned by the join query. 

An equijoin is a join with a join condition containing an 
35 equality operator. An equijoin combines rows that have 
equivalent values for the specified columns. Queryl is an 
equijoin that combines the rows of tables R and S where the 
value in column r. a is the same as the value in column s.a; 

40 

QUERY1 

SELECT* 
FROM R, S 
WHERE r.a - s.a; 



In this example, table R is the "left" or "child" side table 
of the join, and table S is the "right*' or "parent** table of the 
join. The join illustrated by Queryl is a "simple 1 * or "inner" 

50 join. With an inner join, rows from the child table that do not 
satisfy the join condition are not reflected in the join result. 
In contrast, an outer join returns all child rows that satisfy 
the join condition and those rows from the child table for 
which no rows from the parent satisfy the join condition. 

55 Computer database systems that are used for data ware- 
housing frequently maintain materialized views that contain 
pre-computed summary information in order to speed up 
query processing. Such summary information is created by 
applying an aggregate function, such as SUM, COUNT, or 

60 AVERAGE, to values contained in the base tables. Materi- 
alized views that contain pre-computed summary informa- 
tion are referred to herein as "summary tables" or more 
simply, "summaries". 

Summary tables typically store aggregated information, 

65 such as "sum of PRODUCT_SALES, by region, by 
month." Other examples of aggregated information include 
counts of tally totals, minimum values, maximum values, 
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and average calculations. Summary tables are used to reduce 
the overhead associated with processing queries that request 
summary information. Because the summary information is 
already pre-computed, it need not be re -computed during the 
execution of each query that requests summary information. 5 
Instead, the pre-computed summary values can be directly 
retrieved from the summary columns of the summary table. 

When a database user knows that a particular materialized 
view contains the data desired by the user, the user can 
formulate a query that extracts the desired data directly from 1Q 
that materialized view. However, there are numerous cir- 
cumstances in which a user will design a query that does not 
reference a materialized view to retrieve a set of data that 
can be most efficiently retrieved by accessing the material- 
ized view. For example, the user may not be aware of all of ^ 
the materialized views available in the database, or may not 
be sure how a particular materialized view could be used to 
retrieve the desired information. 

Even when the user is aware that materialized views are 
available that have data desired by the user, the user may 2Q 
prefer to have the server transparently rewrite the query in 
terms of the materialized views, rather than writing the 
query to directly reference the materialized views. By allow- 
ing the server to rewrite the query, the database administra- 
tor may easily add and drop materialized views without 25 
affecting the application code that imbeds the user query. For 
example, if a user query is rewritten by the server to use a 
particular materialized view MV1, then the application will 
not break if the database administrator decides to drop MV1. 
In contrast, if the user query directly references MV1 in the 3Q 
application code, then dropping MV1 will break the appli- 
cation. 

In light of the foregoing, some database servers include 
mechanisms for rewriting queries that do not reference 
materialized views in a way that causes them to reference 35 
materialized views. The execution of the rewritten query is 
often considerably improved relative to the original query 
because the relational objects accessed by the rewritten 
query are (e.g. the materialized views) much smaller than 
the objects referenced in the original query (e.g. the base ^ 
tables), and/or the number of objects used by the rewritten 
query is less than the number of objects referenced in the 
original query. 

For example, a summary can correspond to a query which 
joins two tables and performs data aggregation to compute 45 
sum-of-sales by city, and stores the materialized result in a 
table. If a user issues a query Q which requests sum-of-sales 
by city, then Q can be transparently rewritten to access 
pre-computed data stored in the summary table. Because of 
the rewrite, the result for Q can be quickly produced by 50 
simply scanning the summary table instead of joining two 
tables and then aggregating the data. 

Currently, database systems that include query rewrite 
mechanisms perform a series of tests on an incoming query 
to determine whether the query can be rewritten to access a 55 
particular materialized view. The tests include tests for (1) 
join compatibility and (2) data sufficiency. For summary 
tables, the tests additionally include tests for (3) grouping 
compatibility and (4) aggregate computability. 

With respect to join compatibility, the test is passed if all 60 
joins in a materialized view match exactly with joins in a 
query, with possibly zero or more additional joins in the 
query. For example, if a materialized view joins tables A and 
B, and the query requires a join of A, B and C, then the join 
compatibility test is passed. However, if the materialized 65 
view joins tables A, B and D, and the query requires a join 
of A, B and C, then the join compatibility test fails. 
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With respect to data sufficiency, the test is passed if all 
necessary columns of a query can be obtained using the data 
stored in the materialized view. If values from a column 
necessary for the processing of a query are not contained in 
the materialized view, then the data sufficiency test fails. 

With respect to grouping compatibility, the test is passed 
if the aggregates in a summary are at the same or lower level 
than the level of aggregates in a query. For example, a sum 
of sales by month is at a lower level than a sum of sales by 
year. Therefore, the grouping compatibility is passed if the 
summary has sum of sales by month and the query asks for 
sum of sales by year, but is failed if the summary has sum 
of sales by year and the query asks for sum of sales by 
month. 

With respect to aggregate computability, the test is passed 
if each aggregate required by the query is computable from 
one or more aggregates in the summary. For example, the 
aggregate computability test is passed if the summary con- 
tains sum and count values, and the query requires an 
average. The aggregate computability test fails if the sum- 
mary only contains sum, and the query requires an average. 

If all of these tests are passed, then the rewrite mechanism 
determines that the received query can be rewritten to 
reference the materialized view in question. 

Unfortunately, as these tests are currently applied, data- 
base servers often conclude that materialized views cannot 
be used to process a query when in fact they could be. In 
fact, database servers may conclude that a query cannot be 
rewritten to access a materialized view when use of the 
materialized view would actually be the most efficient way 
to process the query. 

Based on the foregoing, it is clearly desirable to provide 
a query rewrite mechanism that is capable of rewriting 
queries to access materialized views that would otherwise 
have been considered ineligible by conventional rewrite 
mechanisms. 

SUMMARY OF THE INVENTION 

A method and system are provided for processing queries, 
where the queries do not reference a particular materialized 
view. According to one aspect of the invention, techniques 
are provided for handling a query that specifies a first set of 
one or more aggregate functions, where the particular mate- 
rialized view reflects a second set of one or more aggregate 
functions. Under these conditions, it is determined whether 
the particular materialized view satisfies each condition in a 
set of conditions, the set of conditions at least including: 
that each aggregate function in the first set of aggregate 
functions be computable from one or more correspond- 
ing aggregate functions in the second set of aggregate 
functions, and 

that the argument to each aggregate function in the first set 
of aggregate functions be equivalent to the argument of 
the one or more corresponding aggregate functions in 
the second set of aggregate functions, 
where first set of aggregate functions includes an particu- 
lar aggregation function to be applied to a target 
population, and 
where the particular aggregation function is an aggrega- 
tion function from a set of aggregation functions that 
consists of variance and standard deviation. 
Under these conditions, a test is performed to determine 
whether each aggregate function in the first set of aggregate 
functions is computable from one or more corresponding 
aggregate functions in the second set of aggregate functions. 
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The testing includes determining whether the particular panying drawings and in which like reference numerals refer 

materialized view includes a variance, sum, and count of a to similar elements and in which: 

source population on which the target population is func- FIG. 1 is a block diagram of a computer system upon 
tionally dependent. If the materialized view satisfies each which an embodiment of the invention may be implemented; 
condition in the set of conditions, then the query is rewritten 5 FJGS 2A> 2B ^ 2C m parts of a flow chart that 
to produce a rewritten query that references the materialized ilhlstrates steps for selecting a query execution plan accord- 
view and derives the results of the particular aggregation • {Q afl embodiment of the invention; 

function for the target population from the variance, sum, _ ¥ _ „ . _ . ...... . c < 4 

j , f *L I,,- FIG. 2D is a flow chart that illustrates steps for determin- 

and count of the source population, , . . , . , . *. ... , . 

A . r*u • *• * u ■ * n ine whether a particular materialized view is eligible to be 

According to another aspect of the invention, a techniques 10 & , . p . , & 

. .i. . /i\ j . used to rewnte a particular query; 

are provided for processing a query that (1) does not r M 79 

reference a particular materialized view, (2) specifies a first FIG. 3 is a flow chart that illustrates steps for determining 

set of one or more aggregate functions, where the particular whether a materialized view is join compatible with a query 

materialized view reflects a second set of one or more according to an embodiment of the invention; 

aggregate functions. The technique includes determining 15 FIG. 4 is a flow chart illustrating steps for determining 

whether the particular materialized view satisfies each con- whether a join is lossless; 

dition in a set of conditions, the set of conditions at least pIG. 5A is a diagram that illustrates a join graph; 

including: p IG 5B ^ a diagram that illustrates a join graph; 

that each aggregate function in the first set of aggregate F , G 6A is a diagram that mustrates a joitl graph 

functions be computable from one or more correspond- 20 ^ ^ ^ materiaUzed view; 

ing aggregate functions in the second set of aggregate . • - L 

functions' and 1S a dia S ram mat illustrates a join graph associ- 

. , „ ated with a query; 

that the argument to each aggregate function in the first set _ T _ . , . £ . . 

c * a u • i .u ♦ r FIG. 6C is the join graph of FIG. 6 A, where the common 

of aggregate functions be equivalent to the argument of . , l ■ • u ^1^7* • • u 

the one or more corresponding aggregate functions in 25 s ™ b ^Z ^ T V ^ ^ 

the second set of aggregate functions. of nG 6B 15 ^ghnghted. 

A test is performed to determine whether the argument of FIG. 7 is a flow chart illustrating steps for performing an 

each aggregate function in the first set of aggregate functions aggregate compatibility test according to an embodiment of 

is equivalent to the argument of one or more corresponding me invention; 

aggregate functions in the second set of aggregate functions 30 FIG. 8A is a block diagram of an exemplary fact table; 

by FIG. 8B is a block diagram of an exemplary 

a) creating a transformed version of the argument by de-normalized dimension table; 

transforming the argument of the aggregate function to FIG. 8C is a block diagram of exemplary dimension tables 

a canonical form; 35 in a normalized schema; and 

b) creating transformed versions of the arguments of the pj G s. 9A, 9B and 9C are block diagrams of tables used 
one or more corresponding aggregate functions by t0 illustrate scaling factors according to an embodiment of 
transforming the arguments of the one or more corre- me invention. 

sponding aggregate functions to a canonical form; and 

c) comparing transformed version of the argument to the 40 DETAILED DESCRIPTION OF THE 
transformed versions of the arguments of the one or PREFERRED EMBODIMENT 

more corresponding aggregate functions. . , , r . . A A . 

If the materialized view satisfies each condition in the set / method f a 1 d .apparatus for rewnUng queries to take 

of conditions, then the query is rewritten to produce a ^vantage of existing materialized views is described. In the 

rewritten query that references the materialized view. 45 fo,,owmg description for the purposes of explanaUon, 

At , . . t j j r *• *u * numerous specific details are set forth in order to provide a 

A technique is also provided for rewriting queries that *\ . . . £ ■„ u 

. t M , . - H . . • * ui thorough understanding of the present invention. It will be 

specify an outer join that has a dimension table on the & „ *\ . 

u*i j -j t *u * - • j c 4 * ui *u„ , apparent, however, to one skilled in the art that the present 

child-side of the outer join and a fact table on the parent-side . vv . ' . y . , 4 . 1£ , . 

e f , t ■ ■ rjJ„ • „„ ,„ . „ invention may be practiced without these specific details. In 

of the outer loin. The query is rewritten to produce a , „ , . , • 

J . 1 * r • »u r . . ui- • other instances, well-known structures and devices are 

rewritten query by replacing references to the fact table in 50 , r . , - , 

the query with references to a materialized view. The rewrit- s ^ OWG . ln bI «* dla S ram tormm older to avold v™"**^ 

ten query specifies an outer join that has the dimension table obscuring the present invention. 

on the child side and the materialized view on the parent HARDWARE OVERVIEW 
side. 

By using the techniques described herein, more queries 55 FIG. 1 is a block diagram that illustrates a computer 

can take advantage of the set of materialized views that exist system 100 upon which an embodiment of the invention 

in the database, thus resulting in improved query execution may be implemented. Computer system 100 includes a bus 

speeds. Conversely, the number of materialized views that a 102 or other communication mechanism for communicating 

database must contain in order to efficiently process a information, and a processor 104 coupled with bus 102 for 

particular set of queries is reduced, thus reducing the size of 60 processing information. Computer system 100 also includes 

the database and eliminating the overhead associated with a main memory 106, such as a random access memory 

maintaining the materialized views that are no longer nec- (RAM) or other dynamic storage device, coupled to bus 102 

essary. for storing information and instructions to be executed by 

processor 104. Main memory 106 also may be used for 

BRIEF DESCRIPTION OF THE DRAWINGS 6J ^ temporary variables Q \ other mtermediate informa . 

TTie present invention is illustrated by way of example, tion during execution of instructions to be executed by 

and not by way of limitation, in the figures of the accom- processor 104. Computer system 100 further includes a read 
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only memory (ROM) 108 or other static storage device data on bus 102. Bus 102 carries the data to main memory 

coupled to bus 102 for storing static information and instruc- . 106, from which processor 104 retrieves and executes the 

tions for processor 104. A storage device 110, such as a instructions. The instructions received by main memory 106 

magnetic disk or optical disk, is provided and coupled to bus may optionally be stored on storage device 110 either before 

102 for storing information and instructions. 5 or after execution by processor 104. 

Computer system 100 may be coupled via bus 102 to a Computer system 100 also includes a communication 

display 112, such as a cathode ray tube (CRT), for displaying interface 118 coupled to bus 102. Communication interface 

information to a computer user. An input device 114, includ- 118 provides a two-way data communication coupling to a 

ing alphanumeric and other keys, is coupled to bus 102 for network link 120 that is connected to a local network 122. 

communicating information and command selections to 10 For example, communication interface 118 may be an inte- 

processor 104. Another type of user input device is cursor grated services digital network (ISDN) card or a modem to 

control 116, such as a mouse, a trackball, or cursor direction provide a data communication connection to a correspond- 

keys for communicating direction information and com- ing type of telephone line. As another example, communi- 

mand selections to processor 104 and for controlling cursor cation interface 118 may be a local area network (LAN) card 

movement on display 112. This input device typically has 15 to provide a data communication connection to a compatible 

two degrees of freedom in two axes, a first axis (e.g., x) and LAN. Wireless links may also be implemented. In any such 

a second axis (e.g., y), that allows the device to specify implementation, communication interface 118 sends and 

positions in a plane. receives electrical, electromagnetic or optical signals that 

The invention is related to the use of computer system 100 carry digital data streams representing various types of 

for rewriting queries to take advantage of existing materi- 20 information. 

alized views. According to one embodiment of the Network link 120 typically provides data communication 
invention, queries are rewritten by computer system 100 to through one or more networks to other data devices. For 
take advantage of existing materialized views in response to example, network link 120 may provide a connection 
processor 104 executing one or more sequences of one or through local network 122 to a host computer 124 or to data 
more instructions contained in main memory 106. Such 25 equipment operated by an Internet Service Provider (ISP) 
instructions may be read into main memory 106 from 126. ISP 126 in turn provides data communication services 
another computer-readable medium, such as storage device through the world wide packet data communication network 
110. Execution of the sequences of instructions contained in now commonly referred to as the "Internet" 128. Local 
main memory 106 causes processor 104 to perform the network 122 and Internet 128 both use electrical, electro- 
process steps described herein. In alternative embodiments, 30 magnetic or optical signals that carry digital data streams, 
hard-wired circuitry may be used in place of or in combi- The signals through the various networks and the signals on 
nation with software instructions to implement the inven- network link 120 and through communication interface 118, 
tion. Thus, embodiments of the invention are not limited to which carry the digital data to and from computer system 
any specific combination of hardware circuitry and software. 100, are exemplary forms of carrier waves transporting the 

The term "computer-readable medium" as used herein information, 
refers to any medium that participates in providing instruc- Computer system 100 can send messages and receive 
tions to processor 104 for execution. Such a medium may data, including program code, through the network(s), net- 
take many forms, including but not limited to, non-volatile work link 120 and communication interface 118. In the 
media, volatile media, and transmission media. Non-volatile ^ Internet example, a server 130 might transmit a requested 
media includes, for example, optical or magnetic disks, such code for an application program through Internet 128, ISP 
as storage device 110. Volatile media includes dynamic 126, local network 122 and communication interface 118. 
memory, such as main memory 106. Transmission media The received code may be executed by processor 104 as 
includes coaxial cables, copper wire and fiber optics, includ- it is received, and/or stored in storage device 110, or other 
ing the wires that comprise bus 102. Transmission media can 4J non-volatile storage for later execution. In this manner, 
also take the form of acoustic or light waves, such as those computer system 100 may obtain application code in the 
generated during radio-wave and infra-red data com muni- f orm G f a carrier wave, 
cations. 

Common forms of computer-readable media include, for TERMS AND NOTATION 

example, a floppy disk, a flexible disk, hard disk, magnetic 50 For ^ of ej£planatioQ) lhe following lerms and 

tape, or any other magnetic medium, a CD-ROM, any other COQVeiltions are used herein l0 describe embodiments of the 

optical medium, punchcards, papertape, any other physical invention* 

medium with patterns of holes, a RAM, a PROM, and . . , . 

EPROM, a FLASH-EPROM, any other memory chip or ^Jf*". >< 18 used herem * n ™°<}°* °P^ tat : 

cartridge, a carrier wave as described hereinafter, or any 55 Tlus,Tl><T2 inchcates an inner join between tables Tl and 
other medium from which a computer can read. 

Various forms of computer readable media may be ™ e J?^ 3 ^."^" 15 used here * * ™ outer j™ °^ or A 

involved in carrying one or more sequences of one or more ™us, T1->T2 indites an outer join between tables Tl and 

instructions to processor 104 for execution. For example, the ^ wherc table T1 15 thc outer table - 

instructions may initially be carried on a magnetic disk of a 60 In ^e joins HxT2 and T1->T2, Tl is referred to as the 

remote computer. The remote computer can load the instruc- "left" or "child" table and T2 is referred to as the "right" or 

tions into its dynamic memory and send the instructions over "parent" table. 

a telephone line using a modem. A modem local to computer A complex join involves numerous joins. For example, 

system 100 can receive the data on the telephone line and the complex join TxT2xT3->T4 requires an inner join 

use an infra-red transmitter to convert the data to an infra-red 65 between Tl and T2 to produce a first set of rows, an inner 

signal. An infra-red detector can receive the data carried in join between the first set of rows and T3 to produce a second 

the infra-red signal and appropriate circuitry can place the set of rows, and an outer join between the second set of rows 
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and T4 to produce a third set of rows. The third set of rows 
constitute the result set of the complex join. 

Joins are performed by combining two tables. In this 
context, the term "table" refers to any source of rows, 
including the product of join operations between one or 
more other tables. For example, the complex join illustrated 
above can be expressed as an outer join between a table RSI 
and table T4, where table RSI is equal to the result of 
(HxT2xT3). In this case, RSI is the child table of the 
join, and T4 is the parent table of the join. Similarly, the join 
Tl><T2xT3 can be expressed as an inner join between a 
table RS2 and table T3, where RS2 is equal to the result of 
(HxT2). 

Joins can be illustrated using join graphs. In a join graph, 
vertices represent base tables and edges represent joins 
between the base tables. For example, FIG. 5A is join graph 
illustrating the joins contained in the query: 

Select product_name, city, month, sum(dollarsales) 

from Sales, Product, Region, Time 

where 

Sales.prod_key-Product.prod_key 

Sales.region__key "Region. region__key 

Sales.time_key =Time .time_key 

Group by product_name, city, month 

Comparing a join graph of a materialized view with the 
join graph of a query produces a combined join graph that 
includes (1) a set of joins that are common to both the 
materialized view and the query, (2) a set of joins that are in 
the query but not the materialized view, and (3) a set of joins 
that are in the materialized view but not in the query. 

The set of joins that are common to both the materialized 
view and the query are referred to as "common section". 

The set of joins that are in the query but not the materi- 
alized view are referred to as the "query delta**. 

The set of joins that are in the materialized view but not 
in the query are referred to as the "materialized view delta". 

For example, assume that a query contains the join 
A><BxC->D and a materialized view definition contains 
the join AxBxE. Under these conditions, the join AxB 
is the common section, the joins BxC and C->D belong to 
the query delta, and the join BxE belongs to the material- 
ized view delta. 

A join in a materialized view definition is said to be 
"non-matching" relative to a query if the join belongs to the 
materialized view delta. For example, if a query contains the 
join AxBxC->D and a materialized view definition con- 
tains the join AxBxE, then the join BxE is a non- 
matching join relative to the query. 

A join is referred to as a "lossless" join when all rows in 
the child table are reflected in the join result. All outer joins 
are lossless joins by definition because all rows in the outer 
table of an outer join are always reflected in the result of the 
outer join. 

A materialized view can be thought of as a join between 
the common section and the materialized view delta. The 
join between the common section and the materialized view 
delta is "lossless" relative to the common section if each row 
that exists in the common section is reflected in at least one 
row of the materialized view. 

Ajoin is referred to as a "one-to-one" join when each row 
in the child table is reflected in no more than one row of the 
join result. One way to ensure that a join will be one-to-one 
is to impose a uniqueness constraint on the join column of 
the parent table. If all values in the join column of the parent 
table are unique, then no more than one row in the parent 
table will have a value in its join column that matches any 
given value in the join column of the child table. 
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A join is referred to as a "one-to-N" join when no 
. constraints prevent rows in the child table from combining 
with more than one row in the parent table. 

Ajoin is "one-to-one lossless" relative to the child table 
5 if each row in the child table is reproduced exactly once in 
the result of the join. If the join is an inner join, then the join 
between the child table and the parent table is one-to-one 
lossless if each row in the child table combines with exactly 
one row of the parent table. This does not prevent more than 
10 one row of the child table from joining with the same row 
in the parent table. 

Stated another way, a join is one-to-one lossless if either: 

(1) the join is an inner equijoin, and 

(2) the parent side join key is unique, and 

(3) the child side join key has no nulls, and 

(4) there is no child side join key value that is not also a 
parent side join key value (this condition will be 
satisfied if, for example, a referential integrity relation- 

20 ship exists between the child side join key and the 
parent side join key); 

or 

(1) the join is a left outer join, and 

(2) the parent side join key is unique. 

25 Ajoin is a "one-to-N lossless join" when each row of the 
child table is reflected in at least one row produced by the 
join. All conditions shown above for one-to-one lossless 
joins must be satisfied for a join to be a one-to-N lossless 
join, with the exception that the parent side join key does not 

30 have to be unique. 

QUERY PROCESSING OVERVIEW 

According to an embodiment of the invention, when 
accessing a materialized view will reduce the execution cost 
of a query, an execution plan that accesses the materialized 
view is used to execute the query, even though the original 
query does not make reference to the materialized view. 
FIGS. 2A, 2B and 2C contain a flowchart that illustrates the 
steps performed by a database server in response to receiv- 
ing a query, according to an embodiment of the invention. 

The database managed by the database server includes 
one or more materialized views. The set of all of the 
materialized views included in the database is referred to 

45 herein as the "complete set of materialized views". At step 
200, the complete set of materialized views is "pruned" to 
create a set of possible materialized views. The set of 
possible materialized views includes those materialized 
views that could possibly be used to process a received 

50 query. 

Various criteria may be used during this pruning process. 
For example, one possible pruning criteria may be that at 
least one of the tables referenced in the received query must 
be a base table of the materialized view. Based on this 

55 criteria, a materialized view that has base tables A, B, and C 
would qualify as a "possible materialized view** with respect 
to a query that requires a join between tables A and D. On 
the other hand, a materialized view that has base tables B, C 
and E would not qualify as a "possible materialized view" 

60 with respect to a query that requires a join between tables A 
and D. 

Steps 202 and 204 form a loop in which each materialized 
view in the set of possible materialized views is processed. 
During the processing of each materialized view, the data- 
65 base server determines whether the materialized view is 
actually eligible to be used in a rewrite of the received query, 
and if so, the relative benefit gained by using that particular 
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materialized view. Specifically, at step 202, it is determined 
whether any "possible materialized views" are left to be 
processed. If all possible materialized views have been 
processed, control passes to step 220. Otherwise, control 
passes to step 204. 



12 

MATERIALIZED VIEW ELIGIBILITY 



As mentioned above, the query rewrite process includes a 
phase (step 206) during which it is determined whether a 
particular materialized view is eligible to be used in the 



MM , ., , . .. _ . . rewrite ot a particular query. H(J. ID is a flowchartTthat 

At step 204, an jessed possible mater.al.zed vjew^n SS^SSS^TtSapSeSSacd during this phase of the 

selected. At step 206, it is determined whether the selected K to 
materialized view is eligible for use in rewriting the received 



query. If the selected materialized view is not found to be 
eligible, control passes to step 230, and the materialized 
view is removed from consideration. From step 230, control 
passes back to step 202. 

If the selected materialized view is found to be eligible at 
step 206, control passes to step 214. At step 214, a "query 
reduction factor" is compute d for t hp materia 



currently being processed. ln e_query reduction factor is a 
valUt! that uslTinatcs^tiow useful_it will be to access the 
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mate rialized view to proce ss the received query .. The higher 
the query reduction tactor, me greater the esjjp iated benefit 
o f using the materialized view to process the quer y. _j 

— — - According to one embodiment of the invention, the query 
reduction factor for a materialized view is the ratio of (1) the 
su m of the ca KlinaliliesjQL&atcjrin g relations in the query. 
that will be replaced by the material ized view to (2) the 
^af^jT^Iity t^P. ^^^g^vTp.w, When use of the mate- 
rialized~view requires a join back, then the query reduction 
factor is adjusted to take into account the overhead associ- 
ated with the join back. This is simply one example of how 
the query reduction factor may be determined. The present 
invention is not limited to any particular query reduction 

^ factor calculation. 

At step 216, the query reduction factor for the material-' 
ized view being processed is compared to the highest 
previously calculated query reduction factor (the query 
reduction factor of the "current best materialized view"). If 
the query reduction factor for the materialized view being 
processed is greater than the highest previously calculated 
query reduction factor, then the materialized view being 
processed becomes the new "current best materialized view" 
(step 218) and control passes back to step 202. Otherwise, 
the materialized view being processed is removed from 
consideration (step 230) and control passes back to step 202. 

After all possible materialized views have been 
processed, the current best materialized view will be the 
materialized view that has the highest query reduction factor 
of all materialized views that passed the eligibility tests. If 
at least one materialized view passed the eligibility tests, 
control passes from step 220 to step 221. Otherwise, control 
passes to step 228 and the execution plan associated with the 
original query is used to execute the query. 

At step 221, the received query is rewritten to access the 
current best materialized view. The specific steps involved in 
rewriting a query to access a materialized view shall be 
described in greater detail below. 

At step 222, an execution plan is generated for the 
rewritten query. At step 224, an execution plan is generated 
for the original query. At step 226, it is determined whether 
the estimated cost of executing the execution plan associated 
with the rewritten query is less than the estimated cost of 
executing the execution plan associated with the original 
query. If the estimated cost of executing the execution plan 
associated with the rewritten query is less than the estimated 
cost of executing the execution plan associated with the 
original query, then the execution plan associated with the 
rewritten query is used to execute the received query (step 
232). Otherwise, the execution plan associated with the 
original query is used to execute the received query (step 
229). 
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rewrite process when the materialized view at issue is a 
s ummary tabl e. 

Referring to FIG. 2D, during steps 286, 288, 290 and 292, 
a series of tests are performed to determine whether a 
selected summary table is eligible to be used in a rewrite of 
a received query. The tests include tests for join compat- 
ibility (step 286), grouping compatibility (step 288), aggre- 
gate computability (step 290), and data sufficiency (step 
292). 

When the materialized view is not a summary table, the 
grouping compatibility and aggregate computabibty steps 
are not applicable and therefore are skipped. 

As shall be described in greater detail hereafter, embodi- 
ments of the invention implement these tests differently than 
conventional database servers. As a result, summary tables 
that would otherwise have been excluded from consideration 
by conventional database servers may pass these tests in 
database servers that implement the present invention. 

JOIN COMPATIBILITY 

One of the eligibility tests performed by a database server 
to determine whether a materialized view can be used to 
process a received query is referred to as the join compat- 
ibility test. Conventionally, this test is performed by deter- 
mining whether all joins in a materialized view match 
exactly with joins in a query, with possibly zero or more 
ad ditional _ j oins in the query . I n other words. ~flie~ ioih 
compatibilit y test is passed if andj jnly^if^hjj oin graph o f 
t he materialized view_is a subset of the join graph of th e 
received query. For example, if a materialized view joins 
tatJIeTSTnd B, and the query requires a join of A, B and C, 
then the join compatibility test is passed. However, if the 
materialized view joins tables A, B and D, and the query 
requires a join of A, B and C, then the join compatibility test 
fails. 

According to one aspect of the present invention, a query 
rewrite mechanism is provided that performs a different join 
compatibility test than that 'performed by conventional query 
rewrite mechanisms. Specifically, consider the following 
four scenarios: 

(1) A materialized view MV contains a join between 
tables A and B. A query Q contains the same join as in 
MV, between tables A and B. Materialized view MV 
passes the conventional join compatibility test. 

(2) A materialized view MV contains a join between 
tables A and B. A query Q contains the same join as in 
M V between tables A and B, and anothe^oiabetween 
table^A-and-C«-Materialized view MV passes the 
conventional join compatibility test. 

(3) A materialized view MV contains a join between 
tables A and B, and janother join between tables A an d 
C. A query Q contains the same join as in M V between 
tables A and B. Materialized view MV fails the con- 
ventional join compatibility test. 

(4) A materialized view MV contains a join between 
tables A and B, and another join between tables A and 

C. A query Q contains the same join as in M V between 
tables A and B, and another join between tables A and 

D. Materialized view MV fails the conventional join 
compatibility test. 



08/12/2002, EAST Version: 1.03.0002 



5,9< 

13 

Id the third and fourth scenario, the materialized view M V 
fails the conventional join compatibility test because the 
materialized view M V contains a join that is not in the query 
(a "non-matching join"). According to one aspect of the 
invention, the join compatibility test is expanded so that the 
existence of a non-matching join in the materialized view 
does not automatically constitute failure of the join compat- 
ibility test. 

Specifically, when a materialized view contains one or 
more non-matching joins, the query rewrite mechanism 
makes a further determination of whether the join between 
the common section and the materialized view delta is 
lossless. If the join between the common section and the 
materialized view delta is lossless, then the materialized 
view passes the join compatibility test even though it 
contains non-matching joins. If the join between the com- 
mon section and the materialized view delta is not lossless, 
then the join compatibility test fails. 

One technique to determine whether the join between the 
common section and the materialized view delta is lossless 
is to check whether each of the non-matching joins is 
lossless. If each of the non-matching joins contained in 
materialized view definition is lossless, then the materialized 
view passes the join compatibility test in spite of the fact that 
the materialized view contains non-matching joins. 

FIG. 3 is a flow chart that illustrates the steps involved in 
determining whether a materialized view is join compatible 
to a query (step 206) according to an embodiment of the 
invention. At step 300, a join graph is generated for the 
query and a join graph is generated for the materialized view. 
In a join graph, vertices represent base tables and edges 
represent joins between the base tables. For example, 
assume that a materialized view MV is defined by the 
following query: 

QUERY2 

Select product__name , city, month, sum(dollarsales) as 

sumsales 
from Sales, Product, Region, Time 
where 

Sales.prod_key=Product.prod key 

Sales.region key=Region.region_key 

Sales.time_key==Time.time_key 

Group by product_name, city, month 

The join graph for th at materialized view would have the 
join graph illustrated in FIG. 5A. Assume that a database 
server wishes to know whether that materialized view is join 
compatible with a received query: 

QUERY3 

Select product_name, sum( dollarsales) 

from Sales, Product 

where 

Sales.prod„key-Product.prod_key 
Group by product__name 

The join graph for query3 is illustrated in FIG. 5B. After 
the join graphs have been generated for both the material- 
ized view and the received query, control passes to step 302. 
At step 302, it is determined whether the join graph of the 
materialized view is a subset of the join graph for the 
received query. If the join graph of the materialized view is 
a subset of the join graph for the received query, then control 
passes to step 310 and the materialized view passes the join 
compatibility test. 
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If the join graph of the materialized view is not a subset 
of the join graph for the received query, then the join graph 
for the materialized view contains at least one non-matching 
join relative to the query, and control passes to step 304. In 

5 the example given above, the materialized view based on 
query2 contains two joins that are non-matching relative to 
query3: a join between Sales and Region, and a join between 
Sales and Time. 

Steps 304 and 308 form a loop during which step 306 is 

10 performed for each non-matching join of the materialized 
view. At step 306, it is determined whether a non-matching 
join in the materialized view is lossless. If any of the 
non-matching joins in the materialized view are not lossless, 
then control passes to step 312 and the materialized view 

15 fails the join compatibility test. If all of the non-matching 
joins in the materialized view are lossless, then control 
passes to step 310 and the materialized view passes the join 
compatibility test. 

20 GRAPH TRAVERSAL 

Using the join compatibility test described above, every 
non-matching join in the materialized view is tested for 
losslessness. However, if one or more of the non-matching 
joins are outer joins, then the join between the common 

25 section and the materialized view delta may still be lossless 
even though some of the non-matching joins are not lossless. 

For example, assume that a defining query for a materi- 
alized view includes the join: A><B->C><D. Further 
assume that the query in question specifies the join AxB. 

30 In this case, the joins B->C and CxD are non-matching 
joins. B->C is lossless by definition, since B->C is an outer 
join. The join CxD is an inner join and therefore may not 
be lossless. However, under these conditions, the join 
between the common section A><B and the materialized 

35 view delta CxD is lossless even if the non-matching CxD 
is not lossless. This is because all of the rows of AxB will 
be preserved even if they do not match any rows of CxD, 
since the join between AxB and CxD is an outer join. 

4Q Referring to FIG. 6A, it illustrates the join tree of a 
relatively complex materialized view. FIG. 6B illustrates the 
join tree of a query. The query contains one join 602 that is 
not in the materialized view. The materialized view, on the 
other hand, contains six non-matching joins, illustrated in 
e FIG. 6C as joins 604, 606, 608, 610, 612 and 614. 

45 

In step 304 of the join compatibility test illustrated in FIG. 
3, an unprocessed non-matching join is selected. According 
to one embodiment, this selection is not made randomly. 
Rather, the join graph of the materialized view is traversed, 
50 processing each join as it is reached during the traversal 
process. 

In the example illustrated in FIG. 6C, the portion of the 
join graph that matches the query is identified to as common 
section 618. Each join that is connected to the common 

55 section 618 is a root of a subtree within the join graph. In the 
illustrated example, joins 610 and 612 are connected to 
common section 618. Join 610 is the root of a subtree that 
includes joins 610, 608, 606 and 604. Join 612 is the root of 
a subtree that includes joins 612 and 614. 

60 According to one embodiment, each subtree is traversed. 
However, during the traversal not every join is necessarily 
visited. Specifically, the traversal down a particular branch 
of a subtree is terminated if (1) the branch is not connected 
to any other unvisited branches, (2) a join that is not lossless 

65 is encountered, or (3) an outer join is encountered. In the 
case that a join that is not lossless is encountered, the 
traversal ends because the materialized view fails the join 
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compatibility test. In the case that an outer join is a join graph subtree by an outer join, then the materialized 

encountered, the traversal of the branch ends because the view, passes the join compatibility test because the materi- 

joins further down the branch will not have any effect on alized view contains at least one copy of all rows in the 

whether the materialized view is lossless relative to the join common section. However, additional benefits may be 

contained in the query. 5 gained if all non-matching joins are one-to-one. 

For example, the traversal of the materialized view may If the materialized view contains duplicate rows from the 

proceed as follows: common section, a query rewritten to access the materialized 

(1) join 610 is determined to be a lossless inner join view typically has to be rewritten in a way that requires an 

(2) join 608 is determined to be an outer join (traversal of 1fl actional step of removing duplicate common section rows 
the current branch is terminated) 10 from the matemhzed view Such duplicate removal may be 

,^ • - - . • , , , , - . . accomplished, for example, by performing a DISTINCT 

(3) join 612 is determined to be a lossless inner join operation QD a column in ^ materialized view that (1) 

(4) join 614 is determined to be a lossless inner join ex i s t s in the common section and (2) is subject to uniqueness 
At this point, all of the subtrees that stem from common and non-null constraints. When the materialized view is a 

section 618 have been processed. During the traversal, joins 15 summary table, more complicated techniques for removing 
604 and 606 were never processed because join 608 is an the effects of duplications may be necessary. Those tech- 
outer join. Consequently, it was not determined whether niques are described in detail hereafter, 
joins 604 and 606 are lossless However, since join 608 is an However, if alt non-matching joins of a materialized view 
outer join, whether joins 604 and 606 are lossless has no are one . to . one „ wcll as lossless , thcQ the materialized vicw 
effect on whether the jom between the common section and 20 win comain exactl one of fQW m me common 
the materialized view delta * lossless. S ^ QQ ^ ^ quefy Qeed QOt fequire a D[S _ 

TEST FOR LOSSLESSNESS TINCT operation. 

. . . t _ „ . t For example, assume that a database contains the mate- 
As mentioned above, a join is lossless if all rows in the rialized view ^ created by query2 ^ aQd that qucry3 ^ 

child table are reflected in the jom result. Consequently, one * received ReMve tQ query3j MV the non . matching 

way to determine whether a non-matching join of a mate- joins- 

rialized view is lossless would be to execute a query to . . 

verify that every row of the child table of the join is reflected a es re S lon — ev=a egion.region_ ey 

in the materialized view. Unfortunately, execution of such a Sales.time_key-Time.time_key 

query would typically involve too much overhead. In 30 If tDe database server determines that these joins are 

addition, such a query would only indicate whether the join one-to-one lossless, then query3 can be rewritten to access 

is currently lossless, not whether the join will always be MV as illustrated by query4: 



lossless. 

According to one aspect of the invention, the determina- 



QUERY4 



tion of whether a join is lossless is made based on metadata 35 Select product__name, sum(sumsales) 
associated with the tables involved in the join. The steps from MV 

used to determine whether a particular join is lossless are G . pr oduct_name 

illustrated in FIG. 4 Referring to FIG. 4, at step 400 it is ^ a secofld e k) that the database 

determined whether the jom isan outer join. If the join is an desifes tQ rewite (he fol]owin to access {he materi . 

outer join, the jom is (by definition) lossless, and control rimed view MV- 
passes to step 422. 

If the join in not an outer join, then control passes to step QUERY5 
402. At step 402, it is determined whether the child side join ^ , , „ t . 

key has any nulls. While this determination can be made by 45 Select P">duct_name, year, sum(dollarsales) 
scanning the values in the child side join key, it is preferably from Sales > Product, Time, Month 
performed by determining whether a non-null constraint has where 

been placed on the one or more columns that constitute the Sales.prod_key~Product.prod _Jcey 

child side join key. If the child side join key has nulls, then Sales.time_key~Time.time_key 

the join is not lossless and control passes to step 420. 50 T ime.month=Month.month 

If the child side join key does not have nulls, control Group by pro ducL_name, year 

passes to step 404 and it is determined whether any child Q U ery5 contains one join "Time.month=Month.month" 

side join key value is not also a parent side join key value. mat is not contained in the MV. On the other hand, the 
This step may be performed by determining whether there is materialized view MV contains one non-matching join 

a referential constraint on the child and parent side join keys 55 "Sales.region_Jcey=Region.region_key" relative to query5. 
that prevents the insertion of values that are not in the parent If the database server determines that this non-matching join 
side join key and that ensures uniqueness of the parent-side ^ one-to-one lossless, the query can be rewritten: 
join key. If any child side join key value is not also a parent 

side join key value, then control passes to step 420 where it QUERY6 

is determined that the join is not lossless. If all child side join 6 o ^ , , / , ^ 

key values are also parent side join key values, then control Scled P roduc " name - vear ' Wsumsales) 

passes to step 422 and it is determined that the join is from MV, Month 

lossless. where 



Month. month=MV month 
Group by product_name, year 
If all of the non-matching joins in a materialized view are Various techniques may be used to determine whether a 



LOSSLESS ONE-TO-ONE 
ion- matching joins in a mated: 

lossless, or if all those that are not lossless are preceded in lossless non-matching join is one-to-one lossless. According 
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to one embodiment, a lossless join is determined to be 
one-to-one lossless if the parent side join key is subject to a 
uniqueness constraint. Thus, once the join: 
Sales.region_key-Region.region_key 
has been determined to be lossless, the database server 
can determine whether Region.region_key is subject to 
a uniqueness constraint. If Region.region_Jcey is sub- 
ject to a uniqueness constraint, then the join is one-to- 
one lossless. 

SUMMARY TABLES WITH ONE-TO-MANY 
LOSSLESS JOINS 



10 



For example, assume that the database server receives the 
following query (Ql): 
select l.ofk, sum(l.m) 
from l,o 

where l.ofk=o.opk (+) 
group by l.ofk; 

Id query Ql, the common section is L->0 and the 
summary table delta is L->S. Because the join (L->0)->S 
between the common section and the summary table delta is 
one-to-many, some of the rows necessary to generate the 
sum(l.m) required by this query may have been duplicated 



, t . » .» . . <% ,i,in the summary table ST. The duplication occurs because the 

If the join between the common section and the maten- 1 ^ orje _ lo _ many join i. sf k- s .sp k (+) contained in the 

' is summary table definition may have introduced duplicates of 
some of the rows that exist in the common section (L->0). 



alized view delta is one-to-many, then child-side rows of the 
one-to-many join may be duplicated in the join result. When 
the materialized view contains no aggregation (Le. is not a 
summary table), then the effect of duplicate child-side rows 
can be eliminated through the use of a DISTINCT operation 
when the associated keys are exposed, or a surrogate such as 
the primary key or rowid, as described above. However, 
when the materialized view contains aggregation, then the 
aggregate values within the materialized view will have 
been computed based on all child-side rows that are reflected 
in the join result, including child-side rows that are dupli 
cated multiple times in the join result. Consequently, a query 
that is rewritten to access a summary table that has a 
non-matching one-to-many lossless join must be rewritten in 
such a way as to counteract the effect of child-side rows that 
are reflected multiple times in the aggregate values. 

Certain types of aggregate functions, such as MIN and 
MAX, are absolute in that they will produce the same result 
regardless of whether the set of values to which they are 
applied contains duplicates. However, other types of aggre- 
gate functions, such as COUNT and SUM, are cumulative 
and will therefore be affected by the existence of duplicate 
values in the set of values to which they are applied. The 
techniques described hereafter illustrate how, according to 
one embodiment of the invention, queries are rewritten to 
access a summary table when (1) the summary table defi 
nition and the query both call for a cumulative aggregate 
function, and (2) the join between the common section and 
the materialized view delta is one-to-many. — ' 

For example, consider the summary table ST definition 
(Dl): 

ST (l_ofk, Ljsfk, sum„l_m): 
select l.ofk l.sfk, sum(lm) 
from 1, o, s 

where l.ofk=o.opk (+) and l.sfk=s.spk (+) 
group by l.ofk, Lsfk; 

In this definition, a table L is outer joined with both a table 
O and a table S. The graph of this definition can be 
represented as S<-L->0. Because the joins L->S and L-»0 
are outer joins, they are guaranteed to be lossless. However, 
for the purpose of explanation, it shall be assumed that they 
are not one-to-one (i.e. neither o.opk nor s.spk are subject to 
a uniqueness constraint). 

Because o.opk and s.spk are allowed to have duplicates, 
the summary table ST cannot easily be used to rewrite 
queries that only join L and S or queries that only join Land 
O. For example, in the case of a query that joins L and S, 
summary table ST additionally joins (L->S) to O, and will 
introduce duplicates that are not present in the join between 
L and S. Thus, the set of rows that are used to generate the 
aggregate values in the summary table ST is larger than the 
set of rows that must be used to generate aggregate values 
required by the query. 
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According to one embodiment of the invention, the fact 
that the join between the common section and the summary 
table delta is one-to-many is compensated for by "scaling 
down" the aggregate values in the summary table. To 
determine the factor by which the aggregate values in the 
summary table must be scaled down, the database server 
generates for each child-side row a count of how many 
parent side rows match the child-side row. If the child-side 
row matches five parent-side rows, then the child-side row 
will be reproduced five times in the join result. 
Consequently, any aggregate values that are generated based 
on that join result will have aggregated each of the child-side 
row values five times. 

The count of how many parent-side rows a particular 
child-side row combines with may be calculated using the 
SQL expression: 

(count (distinct Child Jo mCol||ParenLRowIdentifier)) 

35 where Child JoinCol is the name of the child-side join 
column and Parent.RowIdentifier is the name of the mate- 
rialized view column that contains the row identifiers of the 
parent-side rows. Thus, the expression to count the number 
of times each of the child-side rows in the L->S join is 

40 duplicated in summary table ST would be (count (distinct 
l.sfk||s.rowid)). In the expression (count (distinct 

<A- X l.sfk||s.rowid)), Lsfk designates a single value of the l.sfk= 
s.spk predicate, and s.rowid designates all rows of S that join 
with it. 

The expression to count the number of times a child-side 
row is duplicated in a join becomes slightly more compli- 
cated when outer joins are involved, since a child-side row 
that matches zero parent side rows is reproduced once in the 
join result. Thus, the correct scaling factor for a child join 
column value that does not match any parent join column 
value is "1", even though the expression (count (distinct 
l.sfk||s.rowid)) would produce "0" in such cases. 

To force the expression to return a "1" when the child side 
join value does not match any parent side join values, the 
following expression may be used: 

count(distinct decode(Parent.RowIdentifier ) null, '1', 
Child JomCol||Parcnt.Rowldcntifier)) 
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STORED SCALE FACTORS 

The values returned by the count expression given above, 
referred to herein as scaling factors, can be generated at the 
time the summary table is generated, and stored in the 
summary table itself. For example, the definition of sum- 
mary table ST may be revised as follows: 

ST (l_ofk, I_sfk, o_per_l, s__per_l, sum_JL_m): 

select l.ofk, l.sfk, 
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couot(distinct decode(o.rowid, null, '1\ o.ofk || o.rowid)) during the aggregation. Hence, the resulting aggregate val- 

count(distinct decode(s.rowid, null, *1\ &sfk II s.rowid)) ues are not accurate. For example, the actual sum of sales for 

,/v stores 1, 2 and 3 are $12, $18 and $12, respectively. 

^ ' However, the values in the sumsales column of summary 

from °' s 5 table 904 for stores 1, 2, and 3 are $24, 54, and $12, 

where l.ofk=o.opk (+) and l.sfk=s.spk (+) respectively. 

group by l.ofk, l.sfk; FIG. 9C illustrates a summary table 906 that is generated 

ST has two additional columns when generated based on by the summary table definition (D2): 

this revised summary table definition: o__per_l and s_per_ ST sumsales? scalingfactor, city): 

1. For any given row in ST, the o_per__l column indicates in , 4 c 4 4 /c . i \ 

how many rows from table 0 combined with the Lofk for **** factstore ' ^m(fact.sales), 

that row. Similarly, for any given row in ST, the s_per_J count(distinct decode(detail.rowid, null, T, fact.store || 

column indicates how many rows from table S combined detail.rowid)) 

with the l.sfk for that row. detail.city 

Having stored scaling factors in this manner, query Ql 15 f rom fact, detail 

can be rewritten to access summary table ST as follows: where f acUtore -detaiLstDre (+) 

select l_ofk, sum(sum_l„m/s_per_l) group by fact store> detai i. cuy; 

from ST Summary table 906 reflects summary table 904 with the 
group by l_ofk; addition of a scalingfactor column. For each row in the 
Id this rewritten query, the references to tables L and O 20 summary table 906, the value in the scalingfactor column 
have been replaced with a reference to ST Because the indicates how many times each child-side row that was used 
L->0 join was already performed during the creation of ST, to create that summary table row was counted during the 
the rewritten query requires no join. In addition, the argu- aggregation. For example, row 910 has a scalingfactor value 
ment to the SUM function was changed from "l.m" to of 3. Thus, all child-side rows used to generate the sumsales 
"sum_l_m/s_per_r. In this example, the values in the 25 value in row 910 (i.e. $54) were counted three times. Stated 
sum_L_m column of each row of ST are divided by the another way, all child-side rows used to generate the sum- 
scaling factor s_per__l. The s_per_l scaling factor for a sa les value in row 910 combined with three rows in the 
given ST row is the number of rows from table S that parent-side table, and thus were each reproduced three times 
combined with the l.sfk value in that row of ST to generate in the outer join results that were aggregated to create row 
the sum_l_m value in that row. Dividing by this scaling 30 910. 

factor has the effect of scaling down the values from Once a summary table with scaling factors has been 

sum_l_m to compensate for any duplication of rows pro- generated, the summary table may be used to rewrite queries 

duced when L->0 was joined with S during the creation of that call for cumulative aggregates. In the rewritten queries, 

ST the argument to the cumulative aggregate function is the 

Referring to FIGS. 9A, 9B, and 9C, they are block 35 corresponding aggregate column of the summary table 

diagrams of tables that illustrate the use of scaling factors to divided by the appropriate scaling factor. For example, the 

rewrite queries in a way that allows the queries to access query: 

summary tables with one-to-many lossless joins. select fact.store, sum(fact.sales) 

Specifically, FIG. 9A illustrates two tables: fact 900 and f f 

detail 902. FIG. 9B illustrates a summary table 904 gener- 40 

ated based on fact 900 and detail 902 according to the group by fact.store; 

definition (D2)' can ^ e rewr ttten to access summary table 906 as follows: 
ST (store, sumsales, city): ^ ct st-storc, sum(sumsales/scalingfactor) 
select fact.store, sum(fact.sales), detail.city 45 fr° m ST 
from fact, detail group by st.store 
where fact.store=detail.store (+) 
group by fact.store, detail.city; 
The join contained in definition D2 is a one-to-many Using the technique described above, summary tables that 
lossless join. The one-to-many nature of the join is due to the 50 are created using one-to-many lossless joins can be used to 
fact the store column in detail 902, which is the parent-side rewrite queries by (1) storing scaling information within the 
join column, is not subject to a uniqueness constraint. Hence summary table, and (2) using the scaling information to 
the value "1" in the detail.store column appears twice, and sca i e down the cumulative aggregate values contained in the 
the value "2" in the detail.store column appears three times. summary table. According to an alternative embodiment of 
During the generation of summary table 904, the values 55 the invention, the scaling information need not be pre- 
fer the sumsales column are computed by summing the sales computed ahead of time. Rather, the query itself is rewritten 
values for all rows that are associated with each particular m such a way that execution of the query both produces the 
store value. However, each row from fact 900 is counted scaling information and uses the scaling information to scale 
every time it matches a row in detail 902. Thus, each row in down the cumulative aggregate values contained in the 
fact 900 that has a store value of "1" will be counted twice, 60 summary table. 



ON THE FLY SCALING 



For example, query Ql can be rewritten to access the 
summary table created based on definition Dl as follows: 



and each row in fact 900 that has a store value of "2" will 
be counted three times. The row in fact 900 that has a store 
value of "3" will be counted once, even though it does not 

match any row in detail 902, since the join used to generate 

the summary table 904 is an outer join. 65 sc i cct i_ 0 £k, sum(sum_Lm/5_pcrj) 

Because rows from fact 900 may be produced more than from (select Lofk, 1 sffc, sum_J_m, 

once during the join, they may be counted more than once 
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-continued 

count(distinct decode(s.rowid, null, T, sL_J_sEk \\ s.rowid)) 

as s_pcr_l 
from ST, s 

where st.l_s£k ■ s.spk (+) 

group by I oEk, l_sfk, sum_l_m)v 

group by L_ofk; 



The materialized view may contain more than one one- 
to-many lossless join that is not contained in a query. For 
example, consider the query Q2: 

select l.ofk, sum(l.m) 

from 1 

group by l.ofk; 

This type of query can be rewritten using the same 
technique illustrated above, but which generates on-the-fly 
scaling factors for each one-to-many lossless join that exists 
in the summary table but not in the original query. For 
example, the query Q2 may be rewritten as: 



select l_oQt, sum(sum_L_m/(o_peO • s_per_J)) 
from (select Lofk, l_sfk, sum__l_m } 

count(distiDct dccodc(s.rowid, null, T, sLl_sfk || s.rowid)) 
as s_per_J 

court t(distinct decode(o.rowid, null, T, sL_l_ofk j| o.rowid)) 

as o_per_l " 
from ST, s, o 

where sLl_sEk - s.spk (+) and sLL_ofk « o.opk (+) 
group by l_oEk, l_s£k, sum_l_m) v 
group by L_ofk; 



The type of join introduced in the rewritten queries 
matches the join type of the corresponding join in the 
summary table definition. For example, if the join between 
L and S in the summary table was an inner join instead of 
an outer join, then the join introduced in the rewritten query 
would be the inner join stl_sfk=s.spk. 

The benefit of computing scaling factors on-the-fly is that 
no additional space is needed to store scaling factors within 
the summary tables, and existing summary tables that do not 
have scaling factors can also be used for rewrite. However, 
the additional joins required by the rewritten queries that 
perform on-the-fly scaling factor generation slows down 
query processing. 

DERIVABLE JOIN TYPES 

Within a join graph, each edge in the common section 
corresponds to both (1) a join in the materialized view 
between two tables and (2) a join in the query between the 
same two tables. For the purpose of explanation, the join in 
the query that is represented by an edge in the common 
section is said to correspond to the join in the materialized 
view that is represented by the same edge in the common 
section. 

There are many types of join operations that can be 
performed between two tables, including inner joins, outer 
joins, semi-joins and anti-joins. An anti join is an operation 
that produces each row in a child table that does not satisfy 
a join condition relative to any rows in a parent table. For 
example, an anti-join between Tl and T2 where the join 
condition is TLcoll=T2.col2 would produce each row of Tl 
that has a value in the coll column that does not match any 
value in the col2 column of T2. 

A semi join is an operation that produces each row in a 
child table that does satisfy a join condition relative to any 
rows in a parent table. For example, a semi-join between Tl 
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and T2 where the join condition is Tl.coll=T2.col2 would 
produce each row of Tl that has a value in the. coll column 
that does match any value in the col2 column of T2. Each 
row of Tl that satisfies the join condition relative to any 
5 rows in the parent table is produced only once, even though 
it may satisfy the join condition relative to many rows in the 
parent table. 

In the previous discussion, it was assumed that each join 
in a query would be the same type of join as its correspond- 

1° ing join in the materialized view. For example, in FIG. 6C 
the edge within common section 618 represents a join 
between the Product and Sales tables. The edge is in the 
common section because both the query and the materialized 
view contain joins between the Product and Sales. It was 

15 previously assumed that the join between the Product and 
Sales table in the materialized view was the same type of 
join as the join between the Product and Sales table in the 
query. 

According to one embodiment of the invention, under 

20 certain conditions the join compatibility test is satisfied even 
when the join operation in the materialized view is for a 
different type of join than the corresponding join in the 
query. Specifically, when the join type is not identical, the 
query rewrite mechanism determines whether the join type 

25 of the join in the query is "derivable" from the join type of 
the corresponding join in the materialized view. If a mate- 
rialized view otherwise passes the join compatibility test and 
the join type of the join in the query is derivable from the 
join type of the corresponding join in the materialized view, 

30 then the materialized view passes the join compatibility test. 
However, if the materialized view otherwise passes the join 
compatibility test, but the join type of the join in the query 
is not derivable from the join type of the corresponding join 
in the materialized view, then the materialized view fails the 

35 join compatibility test. 

According to one embodiment of the invention, the query 
rewrite mechanism determines that a join type in a query is 
derivable from the corresponding join type in the material- 
ized view if: 

40 

(a) the join type of the join in the materialized view is an 
outer join and the join type of the join in the query is: 

(1) an outer join, 

(2) an inner join, 
45 (3) a semi join, or 

(4) an anti join; or 

(b) the join type of the join in the materialized view is an 
inner join and the join type of the join in the query is: 
(1) an inner join, or 

50 (2) a semi join. 

JOIN DERIVABILITY FOR MATERIALIZED 
AGGREGATE VIEWS 

The scope of rewrite with Materialized Aggregate Views 
55 (summary tables) can be increased by considering join 
derivability. As mentioned above, some joins can be derived 
from other joins. For example, an inner join SxT can be 
derived from the outer join S->T if the latter contains 
anti-join markers. Anti -join markers, in this context, are 
60 values that make it possible to determine, for each row in the 
join result, whether or not the row was produced by com- 
bining rows of the child and the parent row. 

For example, the rowids of a parent table may be used as 
anti-join markers. If the result of an outer join includes a 
65 column (parent_rowid) for the rowids of the parent table, 
then the rows in the join result that have a NULL in the 
parent_rowid column did not join with any parent table 
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rows. Thus, an inner join between the same two tables using Q2: 

the same join condition would produce all of the rows in the select t.y, sum(t.m) 

outer join result that do not have a NULL in the parent_ from t, s 

rowid column. where s.x-t.y(+) 

Given an outer join S->T the following joins can be 5 group by t.y 
derived: lnen tne rewrite would involve a rollup because the child 

• • c t* c T ♦ • - • table could have multiple rows with different values of s.x 

inner join SxT. S->T must contain anti-join markers . A ... A , . r, . , it L1 ™, 

,. J , , c „ l4 . „ . . . that did not combine with any rows in the parent table. Inus, 

which are used for filtering out anh-join rows. ^ ^ ^ ^ chM ^ would pf J uce rQWS 

semi-join S>-T where S is left side of the semi-join. S->T 10 m where t.y is a null value: 

must contain anti-join markers and the join key(s) of S. q2>. 

The latter are used for removing duplicate rows of S. t _ y> sum(sum _ t _ m) 

anti-join S|-T where S is the left side of the anti-join. £ rom gj^ 

S->T must contain anti-join markers which are used for group by t 

filtering out non anti-join rows. 15 Thus a summary table that contains an outer join S->T 
Similarly semi-join S>-T is derived from an inner join and b join k of botQ ta51es can be uscd to rewrite 

SxT provided that SxT contains the join key of S. ries ^ matchin QUter ^ ^ ^ S><J which 

Join denvabihty can be used for rewriting a query that J* . & . . k & 
does not reference a summary table so that the rewritten Iq ma ^ {hQ . k fe Qot ^ join k Iq 

query accesses the summary table. Specifically if a sum- 20 ^ anti _ join markers m the sum ta5le m ^ 
mary table contains an outer join, U can be used, provided to determine whether a row in the by came from the 
auxiliary information exists, to rewrite a query containing mner • {m Qr ^ anti _ join£ , ^ allows mner joins t0 be 
matching outer mner semi, or anti-joins. deriyed from QUter join {q a sum table> 

Taking jom denvabmty mto account when determining p 0f exam «i e . 

how a query may be rewritten is important because it 25 
increases the number of queries that can be covered with a 

summary table. In addition, by accounting for join ST2(l_c, o_c, o_oj_fiag > s_c, s_oj_flag, sunuLm): 

derivability, outer joins may be included in summary tables select u, o.c, decode(o.rcwid, null, 4 a\ T), s.c, deoode(s.rowid ) 

without significantly reducing the number of queries that can nu j^ '» '0^um(l.m) aa sum_j_m 

be serviced by the summary table. Further, the use of outer 30 wh ^\ofk = o.o P k( + ) aod i.sfk - s. sp k(+) 

joins is a convenient mechanism for insuring that a join is gJoup by i. c , o.c, decode(o.rowid, null, 'a*, *p), s.c, decode(s.rowid, 

lossless without expensive referential integrity foreign- null, 'a', 'i'); 

primary key maintenance. 

x™,^ T4niTT , n T ^ TXT ™, This summary table contains outer joins S<-L->0. It 

OUTER '^^^^'^^ » joins by Lof k^opk( + ) and l.sfk=s.spk( + J ) and groups by Lc, 
WITH SUMMARY TABLES o.c, s.c. The summary table includes anti-join markers 

The following examples illustrate how a query that sped- decode(o .rowid, null, 'a', V) which mark with V and <i' 
fies an inner join between two tables can be rewritten to anti-join rows and inner join rows of l^>0 respectively, 
access a summary table that specifies an outer join between Similar markers are generated by the statement "decode 

the same two tables. Consider a summary table with an outer 40 (s.rowid, null, 'a', , i , ) M to distinguish between the anti-join 
join S->T on S.x-T.y which groups by both join keys: rows and the inner join rows produced by the join L->S. 

grpj. Using these markers, queries which contain matching 

(s x, t y, sum t m, sum s m) select s.x, t.y, sum ™ a ™ i oins can be rewritten. For example, the following 

it lrJ im 7/™\ query Q3 with L><0: 

(t.m), sum(s.m) 45 h j w 

, . select l.c as l_c, o.c as o_c, sum(l.m) as sum_J_m 

where s.x=t.y( + ) from ^ o 

group by s x t.y where l.ofk-o.opk 

Suppose the database server receives a query Ql which , . 

contains inner join SxT and which groups by the right side 50 group y -c, o.c, 

can De rewritten as 

Q3': 



of the join (i.e., by t.y). 

select l__c, o„c, sum(sum_l_m) 

select t.y, sum(t.m) from ST2 

from l » s 55 where o_oj_Jag-*i' 

where s.x=Ly group by j__ c> 0 _ c; 

group by t.y The ability to rewrite queries to access summary tables 

Ql can be rewritten by filtering from the summary table containing outer joins is important, particularly because 

ST1 the anti-join rows. In this case those are rows where outer joins are, by definition, lossless with respect to the 

s_x is null: 60 child table. Consequently, expensive referential integrity 

Ql' : foreign-primary key constraints do not have to be main- 

t . „ „ im t m tained to enable the rewrite, 

select t_y, sum t m 

from SH OUTER/INNER JOIN TO SEMI JOIN 

where s_x is not null; 65 DERIVABILITY WITH SUMMARY TABLES 

If the query contained outer join S->T and the results Asummary table can be used to satisfy not only a join but 

were grouped by the right side of the join, (i.e., by t.y): also a semi-join. Consider this query: 
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Q4: 

select o.c, sum(o.m) 
from o 

where o.opk in (select Lofk from 1, s where Lsfk=s.spk 

and s.spk='sl') 
group by o.c; 

It can be answered by a summary table containing join 
0><L on o.opk«Lofk that groups (among others) by o.c. For 
example, consider this summary table that joins SxLxO: 

ST3 ofk, l_sfk, o_c, sum_o_m, count__o_c): 

select l.ofk, Lsfk, o.c, sum(o.m), count(o.c) 

from 1, o, s 

where l.ofk-o.opk and Lsfk-s.spk 
group by o.c, Lofk, Lsfk; 

Assume that joins L><0 and L><S are one-to-one loss- 
less. This summary table, in addition to aggregate sum(o.m), 
has count(o.c) which is necessary for the rewrite. 

Query Q4 can be answered as: 

Q4': 

select o_c, sum(sum_o_m/count__o__c) 
from ST3 
where l_sfk-'sr 
group by o_c; 

The semantic of the semi-join has been simulated using 
(sum_o__m/count_o_c). The count__o__c serves as the 
scaling factor for the semi-join. Thus, for the rewrite of a 
query containing a semi-join using a summary table with 
inner or outer join, additional scaling columns in the sum- 
mary table are needed. The additional scaling columns have 
to store the count of the group by columns, e.g., in ST3 
count(o.c) was stored. 

If ST3 contained an outer join, L->0 and corresponding 
anti-join markers like in the following ST4: 

ST4 (l_ofk, Lsfk, o_c, o_oj_flag, sum__o_m, count„ 
o_c): 

select l.ofk, Lsfk, o.c, decode(o.rowid, null, 'a', *i% sum 

(o.m) count(o.c) 
from 1, o, s 

where Lofk=o.opk and l.sfk=s.spk 
group by o.c, decode(o.rowid, null, 'a', *i% Lofk, l.sfk; 
then the rewrite of Q4 would filter out the anti-join group 
by rows as follows: 
Q4»: 

select o_c, sum(sum_o_m/counL_o__c) 
from ST4 

where l_sfk=sr and o_oj_flag='i , 
group by o_c; 

OUTER JOIN TO ANTI JOIN DERIVABILITY 
WITH SUMMARY TABLES 

Assume a query with an anti-join is defined as follows: 



Q5: select o.c, sum(o.m) from o 
where o.opk not in (select Lofk from 1, s 

where Lsfk = s.spk) 
group by o.c; 



Suppose a summary table ST5 includes an outer/join o->l 
as follows: 
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ST5 (o__c, o„opk, s_spk, l_oj_mark, sum o_m): 

select o.c, o.opk, s.spk, decodeO-rowid, NULL, V, T), sum(o.m) 
from o, 1, s 

5 where o.opk - Lofk (+) and s.spk = Lsfk 

group by o.c, o.opk, B.spk, decode(l.rowid, NULL, 'a', T); 



Assuming that Lofk is subject to a NOT NULL constraint, 
Q5 can be rewritten in terms of summary table ST5 as: 



Q5': select o_c, s urn (s um o__m) from ST5 

where (l_oj__mark = 'a* and o__opk is not null) 
group by o_c; 

15 

The predicate "l_oj_mark=V" selects only anti-join 
rows from ST5, and the predicate "o_opk is not null" 
excludes those anti-join rows that have null in o„opk. The 
exclusion of anti-join rows with null in o__opk by the 

20 rewritten query is necessary because the original query Q5 
would not have produced such rows in its result. 

If the query had included a condition that restricted the 
result of the join lxs then its result will also contain some 
inner join rows in addition to the anti-join rows. For 

25 example, consider the following query: 



Q6: select o.c, sum (o.m) from o 

where o.opk not in (select l.ofk from 1, s 

where l.sfk = s.spk and sjspk = 'si') 
group by o.c; 



The result of this query includes anti-join rows as well as 
some rows from o that have o.opk value matching with l.ofk 
35 value in 1 rows. This means the rewritten query must select 
some inner join rows in addition to anti-join rows from the 
summary table. Because inner join rows in a summary table 
could be duplicated, it is necessary to also maintain scale 
factor for each grouping column as shown below: 

40 



ST6 (o_c, o_opk, s_spk, l_oj_mark, scale_o_c, sum_o_m): 
select o.c, o.opk, s.spk, decode(l.rowid, NULL, 'a', *i*), count(o.c), 

sum(o.m) 
from o, 1, s 

45 where o.opk = Lofk (+) and s.spk = Lsfk 

group by o.c, o.opk, sjspk, decode(Lrowid, NULL, 'a', 'i'); 



Now Q6 can be rewritten in terms of ST6 as: 

50 

Q6': select o_c, sum(sum o_m/scale_o_c) from ST5 

where (t_oj_mark - V and o_opk is not null) or 
(l_oj_mark = *i' and not (s_spk = 'si')) 

group by o_c; 



In the rewritten query Q6' in addition to the anti -joins, 
only those inner join rows from ST5 are selected for which 
the condition s_spk='sr does not hold. Also, the aggregate 
value associated with such innerjoin rows is scaled down to 
60 remove the effect of duplication due to the join between o 
and 1 in ST6. 

EMPIRICALLY DERIVED CONSTRAINTS 

Typically, database users explicitly specify the constraints 
65 on the columns of the tables that they create. However, 
according to one embodiment of the invention, the database 
server stores data that indicates whether a constraint is 
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satisfied by a column even though the user has not specifi- The possible existence of null values must be taken into 

cally declared the constraint on the column. account when determining whether, a function can be 

For example, at the time a materialized view is created, derived from other functions. For example, it may initially 

the database server may track whether all values in a appear that the function SUM(A+B) can be derived from the 

particular column are unique. If all values are unique, the 5 functions SUM(A) and SUM(B). However, if it is possible 

database server may save metadata for the table that indi- for Aor B to have null values, SUM(A+B) cannot be derived 

cates that the particular column satisfies a uniqueness con- from SUM(A) and SUM(B). For example, assume that only 

straint. Whenever the materialized view is subsequently the following three rows exist: <5, Null>, <Null, 10>, <3,4>, 

changed (e.g. during an incremental refresh operation), the where the first value is from the A column and the second 

changes are inspected to determine whether the uniqueness iq value is from the B column. In this case, SUM(A+B) would 

constraint is still satisfied. If any change violates the unique- yield SUM(Null, Null, 7)=7. This is because a Null added to 

ness constraint, then the metadata is revised so that it no any number produces a Null. In contrast, SUM(A)+SUM(B) 

longer indicates that the particular column is unique. would yield SUM(5, Null, 3)+SUM(Null, 10, 4)=8+14=22. 

The technique of empirically identifying undeclared con- 
straints may be used for numerous types of constraints, 15 ARGUMENT EQUIVALENCE 
including uniqueness constraints, non-null constraints, and For the purposes of explanation, an aggregate function in 
foreign key constraints. Significantly, these are the very a query ^ said to "correspond to" the one or more aggregate 
constraints that may be used to determine whether a join is functions in the summary table definition from which it may 
lossless and/or one-to-one. Thus, according to one embodi- be d er i V ed. For example, assume that a query has the 
ment of the invention, the metadata that indicates empiri- 20 aggregate functions SUM(A) and AVG(B), and that a sum- 
cally derived constraints is consulted in combination with mary table definition has the a gg rega te functions SUM(A), 
expressly declared constraint information to determine SUM(B), COUNT(B). Under these conditions, the SUM(A) 
whether a particular join is lossless and/or one-to-one. function in the query would correspond to the SUM(A) 

Similar to empirically identifying undeclared constraints, function in the summary table definition. The AVG(B) 

the database system may empirically identify one-to-one and 25 function in the query would correspond to the SUM(B) and 

lossless characteristics of joins within a materialized view. COUNT(B) functions in the summary table definition. 

Each join in the materialized view may have a correspond- witfa respect tQ arguments? aggregate computability 

ing lossless flag and a one-tonone flag. The lossless flag for requircs that argumcnts for the aggregate functions in the 

a join would be set if, during the mitial build of the que ry be "equivalent to" the arguments for the correspond- 

matenalized view, all rows m the child table join with at 30 aggre gate functions within the summary table definition, 

least one row of the parent table. The one-to-one flag would . ^ ^ umGni is equivalent to another argument when it 

be set if all rows in the child table join with exactly one row necessarily produces the same result. For example, the 

of the parent table. During refreshes of the materialized summary table definition may specify SUM(A+B), while the 

view, the flags would be altered appropriately based on specifies SUM(B+A). In this case, the arguments to 

whether any rows of the child table were lost or duplicated 35 the SUM tumSdon afe nQt identical ^ but are Equivalent", 

during the join. Referring to FIG. 7, it is a flowchart that illustrates steps 

AGGREGATE COMPUTABILITY for determining whether a materialized view satisfies the 

When determining whether a materialized aggregate view aggregate computability test relative to a query. At step 700, 

(i.e. a summary table) that is not referenced in a query can 40 an unprocessed aggregate function from the query is 

be used to process the query, an "aggregate computability" selected. For the purpose of explanation, it shall be assumed 

test must be performed. In SQL, an aggregate represents a that the aggregate function AVG(A+B) is selected at step 

function with a single argument. For example, SUM(E) is an ^00. 

aggregate with SUM being a function and E being the At step 702, it is determined whether the selected aggre- 

argument to the function SUM. E can be, for example, a 45 gate function can be derived from one or more aggregate 

simple column or an arbitrary arithmetic expression. SUM functions contained in the materialized view. For the pur- 

(A) sums up all values of column A. SUM(A+B) sums up all pose of explanation, it shall be assumed that the materialized 

computed values of the expression A+B, where A and B are view contains the aggregate functions AVG(A), AVG(B), 

simple columns. COUNT(B+A) and SUM(A). The aggregate function AVG 

The issue of aggregate computability arises from the fact 50 ^ derivable from COUNT and SUM, so control passes to 

that it may be possible to use a summary table to process a ste P 704. 

query even though the aggregate functions and the argu- At step 704, it is determined whether the argument in the 

ments to the aggregation functions in the query are not selected aggregate function is equivalent to the arguments of 

identical to the aggregate functions and the arguments to the the one or more corresponding aggregate functions in the 

aggregate functions in the summary table. 55 materialized view. In the present example, the argument in 

With respect to aggregate functions, aggregate comput- the selected aggregate function is A+B. The arguments in the 

ability requires that the aggregate functions within a query corresponding aggregate functions COUNT and SUM are 

be "derivable" from the aggregate functions in the summary B+A and A, respectively. The argument B+A is equivalent to 

table definition. An aggregate function is derivable from one A+B, but the argument A is not. Therefore, control passes to 

or more other aggregate functions if the results of the 60 ste P 710 aod il is determined that the materialized view is 

aggregate function can be computed from the results of the not aggregate computable relative to the query, 

one or more other aggregate functions. For example, AVG Steps 700 and 706 form a loop in which each aggregate 

(E) can be computed from SUM(E) and COUNT(E) using function in the query is processed. If any aggregate function 

the expression SUM(E)/COUNT(E). Similarly, STDDEV in the query cannot be derived from the aggregate functions 

(E) can be computed from VARIANCE(E) using the expres- 65 in the materialized view, or if any aggregate function in the 

sion SQRT(VARIANCE(E)) where SQRT is a function query has an argument that is not equivalent to the corre- 

which finds the square root of its argument. sponding aggregate functions in the materialized view, then 
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control will pass to step 710 where the materialized view is conform to a particular database language, such as SQL. The 
held to be not aggregate computable. On the other hand, if database language will dictate the format of the arguments 
all aggregate functions in the query can be derived from of the aggregate functions in the statements. Typically, 
aggregate functions in the materialized view, and all aggre- database languages specify a format that makes the argu- 
gate functions in the query have arguments that are equiva- 5 ments understandable to humans. For example, SQL uses 
lent to the corresponding aggregate functions in the mate- words such as "SUM( )" to specify function identifiers, 
rialized view, then control will pass to step 708 where the Similarly, descriptive words such as "Product.product_id__ 
materialized view is held to be aggregate computable, key" may be used as column identifiers. In addition, con- 
It should be noted that the order of the steps illustrated in ventional symbols such as "+" and may be used as 
FIG. 7 may be altered to achieve the same result. io operators. 

Specifically, for any given aggregate function in the query, While this notation is convenient for humans, it is unnec- 

the database server may first proceed to identify all aggre- essarily verbose as far as the database server is concerned, 

gate functions in the summary table definition that have Therefore, according to one embodiment of the invention, a 

equivalent arguments. Once the set of aggregate functions preliminary step in the transformation of arguments to a 

that have equivalent arguments has been determined, the is canonical form involves replacing SQL identifiers with more 

database server may determine whether the selected aggre- compact identifiers that have the same meaning. For 

gate function of the query can be derived from one or more example, the internal unique identifier for a particular col- 

of the aggregate functions in the summary table definition umn may be a two-byte column-in-table identifier combined 

that have equivalent arguments. with a four-byte table identifier, while the SQL identifier for 

DETERMINING ARGUMENT EQUIVALENCE 20 * e column is the text string "Product.product„id 

key \ The internal representation of that text string would 

The computational complexity to determine equivalence ^ significantly m0 re space than six bytes. Therefore, as 

between two arbitrary expressions is infinite. However, the part of the transformation the text string is replaced with the 

complexity is significantly reduced by adopting techniques s ^ va ] ue 

thatworkwitheachsingleexpressionseparately.ratherthan 25 Conventional « canonica i» forms of expression, such as 

working with two expressions at the same tune. According kwcs& ^ ^ Qo( dictate ^ order of ^ 

to one aspect of the mvention equivalence of arguments is a co[ ^ mutativ6 operation . For exampl6 , addition * com . 

established by comparing canonical forms of the Both +Afi ^ +BA ^ y&M Kveix ^ 

arguments, rather than by attempting to directly compare representations of |he 6xpressions A+B aad B+ A. Even 

two expressions. 30 th oug h these expressions are logically equivalent, they may 

According to one aspect of the invention, techniques are be transformed into valid reve(se poUsh represen , a tioiis that 

provided for rewriting the arguments of aggregate functions. do Q0( matcn 

Specifically, arguments are transformed in a way that maps embodim6nts of ^ invention use 

equivalent arguments to the same transfomed argument. techni that Mal expr6Ssions 

These transformed arguments thus produced can be com- 35 £ transformed expression. To 

pared to determine, for example, whether the aggregation v & , , c * u - _* n. 

F , - j • y 9 . , ft TiL f a accomplish this, the transformation techniques sort the oper- 

arguments contained in a query are equivalent to the aggre- ^ P commu ' tatjve ations . ^ A+B and B+A both 

gation arguments specified in the corresponding aggregate +AB, rather than +AB and +BA. 

functions in the definition of a summary table. . . 

During the transformation, arithmetic transformation 40 u Such .a commutative transformation can be accomplished 
rules, such as associativity, commutativity and distributivity, bv the operands of commutative operations accord- 
are used to reduce the argument of each aggregate function »* t0 any aribrary sort key, provided, that the resulting sort 
within the summary table definition to a canonical form. ■ deterministic (reproducible). According to one 

... . ,- . ,. , „ f e embodiment of the invention, the internal numeric ldentifi- 

Accordine to one embodiment, the transformation ot the . . 

. c.u . ui a »„ r-„ . „, „ c ers of the terms of the expression are sorted in ascending 

arguments of the summary table definition are performed at 45 " r 0 

the time the summary table is created. Once generated, the or er 

transformed arguments are persistently stored as metadata Conventional "canonical" forms of expression, such as 

associated with the summary table. reverse P olish . unnecessarily distinguish associatively 

Whenaqueryi S received,theargumentofeachaggregate equivalent representations. For example (A+B)+C is 

function in the query is also transformed into the canonical 50 equivalent to . A+(B+C). The reverse polish form of these 

form. The transformed versions of the query's arguments are expressions is AB+C+ and ABC++, respectively. While 

compared with the transformed versions of the summary algebraically equivalent the conventional canonical 

table definition's arguments. If the transformed arguments f ° rms of ^presentation differ. Accordmg .to one embodiment 

from the query match the transformed arguments of corre- ° f the invention, multiple instances of associative N-ary 

spending aggregate functions in the summary table 55 operators, where N is the number of operands for he 

definition, then the arguments are equivalent. "P™** can «» combined into a angle Nm-ary operator, 

x , . A . - . j . c „^ where Nm is the sum of the count of operands lor all the 

Various techniques may be used to perform the compan- f , r „ , t , 

^ ? c * : • t . a *u associative operators to be transformed. For example, the 

son between the transformed arguments in the query and the *■ . — _ _ „u™ ■ v 1 

r , & , . ♦ ui reverse polish forms AB+C+ and ABC+, where + is a binary 

corresponding transformed arguments in the summary table r , , , . . - , nn ' 

i_ j- * • i *«n addition operator, could both be transformed to a common 

definition. Accordmg to one embodiment, equivalence is w f. AT5 ^ a . ,. f , _ 

j j . P . . . f ' , ^ representation of ABC+3, where the +3 form represents an 

determined by performmg a byte for byte comparison j ' r 

. 4t _ -if f.u ♦ addition operator with three operands, 

between the canonical forms of the arguments. v . t * 

Conventional "canonical'* forms of expression, such as 

TRANSFORMING ARGUMENTS TO A reverse polish, unnecessarily distinguish distributatively 

CANONICAL FORM 65 equivalent representations. For example, A*(B+C) is 

Typically, both summary table definitions and user que- equivalent to (A*B)+(A*C). The reverse polish for these 

ries are submitted to the database server as statements that expressions is ABC+* and AB*AC*+, respectively. The 
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former representation is referred to herein as the "undistrib- example, each possible value for a month column corre- 

uted" representation, and the latter representation is referred sponds to a particular set of values in a day column, 

to herein as the "distributed" representation. While algebra- Therefore the month column is functionally dependant on 

ically equivalent, the conventional "canonical" forms of the day column. Similarly, each value in a year column 

representation differ. According to one embodimenet of the 5 would correspond to a particular set of yahies m a month 

invention, each expression or sub-expression that can be column, and to a particular set of values ,n toe day column. 

distributed is distributed. For example, A*(B + Q is trans- ?^ ucntl * the f 41 colu H m ° *^ C ^* de P endant oa 

c j . /A.*r>\./A*n\ a j;„„ »„ \„ «i,. ra ..;„. both the month column and the day column. 

formed to (A*B)+ A*Q According to an alternative functional dependency may 

embodiment, each distributed expression or sub expression be 0ne ., o . 0ne For ^ £ row ^ a 

is undistributed. For example, (A*B)+{B*C) * transformed 10 ^ had uni Qon _ nuU duct numbcr ^ rod _ 

toA*(B+C). 

key) and a unique, non-mill product name (prodname). 

Conventional "canonical" forms of expression, such as Under these conditions, the prod_key and prodname col- 
reverse polish, unnecessariliy distinguish many other umns are functionally dependent on each other, 
equivalent representations. For example, A+0 is euqivalent Consider a query that includes "SUM(sales) group by 

to A. The reverse polish for these expressions is A'O'+and A, 15 month". A summary table that includes "SUM(sales) group 
respectively. Another example isA+Aand A*2. the reverse by day" is grouping compatible with the query, since the 
polish for these expressions is AA+ and A2'*, respectively. sums of sales for each month can be derived from the sums 
Many other algebraically equivalent possibilities exist. of sales for each day. In contrast, a summary table that 

When dealing with environments that support the concept includes "SUM(sales) group by year" is not grouping com- 
of nullness, such as SQL language expressions, care must be 20 patible with the query, since the sums of sales for each 
taken with respect to identifying equivalent expressions. For month cannot be derived from the sums of sales for each 
example , A* 0 is algebraically equivalent to 0. However, A* 0 Y ear 

is not equivalent to 0 in SQL languages unless it can be Queries may group based on more than one column. For 
proven that A can never be null. A might be proven to be example, a query may specify "SUM(sales) group by region, 
non-null, for example, through exhaustive examination of 25 month". This query requires a separate value for each 
values in A, or through analysis of constraint information in region/month combination. A summary table would not be 
the metadata for A. The reason for potential non-equivalence grouping compatible with this query if it only included 
is that A*0=null value if Aisnull, and A*(M) if Ais not null. *'SUM(sales) group by month", since such a summary table 
According to an embodiment of the invention, in environ- would only contain a single value per month for all regions, 
ments that support nullness, certain algebraic transforma- 30 On the other hand, a summary table would be grouping 
tions are made dependent upon the provability of non- compatible with this query if it included "SUM(sales) group 
nullness of certain terms of the expression or sub- by day, location, product" (assuming that "month" is func- 
expression. In environments that do not support nullness, tionally dependant on "day" and "region" is functionally^ 
those same algebraic transformations are made without such dependent on "location"). 3 
consideration. In all environments, these considerations 35 Aggregate rollup is required if there is at least one 
apply only to transformations that involve the elimination of grouping column in the query that is functionally dependent 
one or more terms that can assume a null value. on a grouping column in the summary table, or there is at 

least one column in the summary table that neither matches 
GROUPING COMPATIBILITY AND exactly with a grouping column in the query nor functionally 

AGGREGATE ROLLUP *o depends on some other grouping column in the summary 

When both a query and a materialized view contain teb ]f- . . 

grouping, a grouping compatibility test is performed to . Conversely, no aggregate rollup is required if each group- 
determine whethe" the materialized view is eligible to be ™* «hmm m the query either matches exacdy with a 

used in rewriting the query. According to one aspect of the ^ ffWB ™ *c summary table or is functionally 

u & . J J >* * * k., 45 dependent on some other grouping column in the query. 

invention, a "grouping compatibility" test is performed by £l" . „ • • . , u / 

K. lL i_ * i • *u », When no aggregate rollup is required, the group by clause 

determining whether each grouping column in the query "u^uua^ B y . . J * A K 

e . ther & & r & of the query may be dropped. Otherwise, it should be 

, . iL . • i * j retained in the rewritten query. 2 

(1) exactly matches a grouping column m the materialized por 9wp ^ . f a ^ spedfies « SUM(saks) from ' 

view ' 50 xi group by month, region" is rewritten to access a sum- 

(2) is functionally dependent on another grouping column mafy (ab[e ST lhat reflects «sUM(sales) from Tl group by 
in the query, or ^ay, re gj on ^ product", then the rewritten query would specify 

(3) is functionally dependent on a grouping column in the "SUM(sumsales) from ST group by month, region". When 
materialized view. this rewritten query is executed, the database server "rolls 

The grouping compatibility rule hinges on the fact that 55 up » tne more specific (liner) aggregate values in the sum- 
generalities can be derived from specifics, but specifics marv taD [ c t0 me more general (coarser) aggregate values 
cannot be derived from generalities. required by the query. _j 

Note that it is trivial to match a query with no grouping COMPLEX AGGREGATES 

to a materialized view with grouping. In this case the ROLL UP Or LOMrLbA AuuKtuA] co 

grouping compatibility is always satisfied because all 60 Many database systems provide support for rolling up 
grouped aggregate values in a materialized view can be certain types of aggregates, such as SUM, COUNT, AVG, 
rolled up into a single aggregate value. MIN and MAX. For example, it is well known that an 

A first column is functionally dependent on a second average can be derived by dividing a sum with a count, 
column when there is a one-to-N mapping of values from the However, according to one embodiment of the invention, a 

first column to values in the second column. Such functional 65 database system is provided that is configured to derive 
dependencies may exist between columns of the same table, certain types of aggregates that are not derivable in conven- 
or between columns that belong to different tables. For tional database systems. 
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Specifically, a technique is provided for determining a query that contains the column that has the "missing infor- 
variance (VAR) or standard deviation (STDDEV). for a mation" continues to be referenced in the rewritten query, 
target population from the VAR, SUM, and COUNT of a Specifically, the rewritten query includes a join clause that 
source population when the target population is functionally causes the materialized view to be "joined back" to an object 

dependent on the source population. For example, the vari- 5 that contains the required data that is missing from the 
ance for each combination of <year, state> can be computed materialized view. 

given the variance, count, and sum by <month, city, prod- For example, assume that a summary table ST is defined 
uct>. by Query7: 

VAR(x) is defined in terms of count(x), sum(x), and OUERY7 

sum(x*x). Sum(x*x) can be determined from the other three 10 

quantities when count(x)>l. When count(x) is 1, VAR(x) is select month, sum(sales) sum__sales 
defined to be zero and therefore sum(x*x) cannot be deter- f rom f act> t j me 

mined; however, in this case sum(x*x) is equal to sum(x) whefe fac V time _ kev= time.nme^ey group by month 
*sum(x) so therefore sum(x*x) can always be determined SJ thcreforc contains ^ columns; and sum _ 

from VAR(x), sum(x), and count(x). salcs ^ rQW in ST indudes a va]ue tha( identifics a 

Given that sum(x*x), sum(x), and count(x) are available particular month, and the sum of all sales that occurred in 
in the source population, sum(x*x), sum(x), and count(x) mat particular month (as recorded in the "fact" table), 
can be determined in the functionally dependent target Assume that the database server receives Query8: 
population by surnrning the sums and counts and recomput- 

ing the new VAR(x). 20 QUERY8 

Specifically, the formula for computing variance based on select year, sum(sales) 
sum and count is: from fact, time 

where fact.time_Jtey=time.time_Jcey 

cumt* 2 ) [sum(x)]2 25 group by year 

Variance = county) Query8 requires the generation of the sum of sales by 

count(jf) - 1 year. The sum of sales for each year can be derived from the 

sum of sales for each month contained in ST, if the database 
system knows to which year each month belongs. 

T^e formula for standard deviation based on variance is: 30 Specifically5 S T contains the sum of sales by month for each 

month in a particular year. By summing those monthly totals 
standard deviations V variance for all months in a given year, the total sum for that given 

year is generated. 

The month-to-year mapping must be known in order to 
To accommodate variance, tables with a variance column 35 exe cute query8 using values from ST. In addition, each row 
usually include a count column and a sum column. However, returned in response to query8 must have the form <year__ 
typically there is no column for sum(x*x). When rollup is S ales_jsum>, where year_val is a year identifier and 

performed, multiple rows are combined into one row of the S ales_sum is the sum of sales for that particular year, 
result. The sum(x*x) for a given row is computed based on Unfortunately, ST does not store year values. Consequently, 
the variance, sum and count for that row. Once sum(x*x) has 40 wn ite values in ST can be used to generate the yearly 
been determined, the row can be rolled up with other rows sales_sum values, a join back to the "time" table must be 
according to the formula: performed in order to establish the month-to-year mapping 

and to obtain the year values. 
sum([sum(x)] 2 ) For the purpose of explanation, the table that contains the 

sum($um( )) - sun ^ count ^ 45 values that are necessary to process the query, but that are 

ro up o variance- sumlcouni(jO] - 1 not in the materialized view, shall be referred to herein as the 

join-back table. The column of the join-back table that 
contains the values required to process the query is referred 
Once the variance is rolled up using this technique, the to as the desired column. To perform the join back, a column 
standard deviation may be derived simply by taking the 50 in the summary table must be joined back to a column in the 
square root of the variance. join-back table. The column in the join-back table to which 

DATA SUFFICIENCY AND JOIN BACK me ^ * * * ^ " ^ 

join-back key. 

The most efficient way to process a query may involve According to one aspect of the invention, two types of 

accessing a materialized view that does not contain all of the 55 join back operations are supported: simple join back and 

data required to process the query. For example, a query that complex join back. Which of the two types of join back 

requests yearly sales totals may be processed most efB- operations to use in a particular situation hinges on whether 

ciently by accessing a summary table that contains monthly the join key is unique in the join-back table. If the join key 

sales totals. If the summary table does not have a column is unique in the join-back table, a simple join back may be 

indicating the year to which each month belongs, then the 60 performed. If the join key is not known to be unique in the 

query cannot be processed exclusively based on the infor- join-back table, a complex join back is performed, 

mation contained in the summary table. Returning to the example given above, assume that the 

According to one embodiment, techniques are provided Time table has columns Month and Year. If the month 
for rewriting queries to retrieve data from a materialized column of the Time table is subject to a uniqueness 
view under these circumstances. However, because the 65 constraint, then query8 can be rewritten to access the sum- 
materialized view does not contain all of the data necessary mary table ST using a simple join back to the Time table as 
to process the query, the object referenced in the original follows: 
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select year, sum(sum_sales) 
from ST, time 

where ST.month=time.month 
group by year 

In Query 0 , the references to the Fact table in Query8 have 
been replaced with references to the summary table ST. In 
addition, the argument to the SUM function has been 
changed from the name of the column of Fact to be summed 
(i.e. sales), to the name of the column of ST to be summed 
(i.e. sum sales). 

On the other hand, assume that the Time table has 
columns Day, Month, Year. The Month values are not 
subject to a uniqueness constraint because the rows associ- 
ated with numerous different days may all specify the same 
month. Under these conditions, the query is rewritten to 
access the summary table using a complex join back to the 
Time table as follows: 

QUERY10 

select year, sum(sum_sales) 
from ST, (select distinct month, year from time) v 
where ST.month=v.month 25 
group by year 

Similar to Query 9, in Query 10 the references made by 
Query8 to the Fact table have been replaced with references 
to the summary table ST and the argument to the SUM 
function has been changed from the name of the column of 30 
Fact to be summed (i.e. sales) to the name of the column of 
ST to be summed (i.e. sum_sales). In addition, Query 10 
replaces the references to the join-back table (i.e. Time). 

Specifically, QuerylO replaces the reference to the join- 
back table in the "from" clause with a view definition. All 35 
subsequent references to the join-back table are replaced 
with references to the view. 

la this example, the view definition defines a view that 
contains one row for every distinct month value in the Time 
table. Within the view, the row for a given month includes 40 
the month identifier for the month and the year identifier for 
the year to which the month belongs. Because each row in 
the view is for a distinct month value, the values in the 
month column of the view are, by definition, unique. 
Therefore, no month will be double-counted when the 45 
monthly values in ST are combined to create the yearly sum 
of sales values. 

In the preceding examples, the desired column appeared 
in the group by clause of a received query. However the 
desired column may appear in many places within the 50 
received query. For example, consider a system in which a 
summary table is defined by query7, and the database server 
received queryll: 

- 55 

QUERYll 

select month, sum(salcs) 
from fact, time 

where fact.timc_Jccy = time.timc_Jccy and 

time, year > 1970 go 
group by month 



In this case, the time. year column is the desired column, 
not because the sales have to be summed by year, but 
because only the monthly sums for years greater than 1970 65 
are desired. In this case, queryll can be rewritten to access 
ST as illustrated by query 12: 



QUERY12 

select ST. mo nth, ST.sum_saIes 
5 from ST, (select distinct month from time) v 

where ST. mo nth - v.month and 
v.year > 1970 



Note that the GROUP BY clause in rewritten queryl2 is 
10 not present because of exact matching of grouping columns 
between queryll and MV (query 7). 

NORMALIZATION AND DENORMAL1ZATION 

Whether a join-back key will be unique in a join-back 
15 table frequently depends on whether the query refers to a 
table in a schema that has been "normalized". In general, a 
normalized schema stores different levels of information in 
different tables to avoid redundantly storing information. On 
the other hand, a de normalized schema contains information 
in fewer tables with more columns, where some of the 
columns store redundant information. 

For example, a denormalized schema may have a table 
that has a row for every city, where each row has a "county 
info" column. The rows for all cities in each given county 
would have the same information in their "county info" 
column. In a normalized schema, a separate "county 1 * table 
would be created that has one row per county. The "county 
info" would be stored in the county table. Because the 
county table has only one row per county, the "county info" 
for any given county would not be repeated. The difference 
between a normalized and a denormalized schema is illus- 
trated in FIGS. 8A, 8B and 8C. 

Referring to FIG. 8A, it illustrates a fact table 800. The 
fact table 800 includes records that correspond to individual 
product sales. For each sale, fact table 800 includes a unique 
identifier ("fact_key")> a time_key that indicates the time at 
which the sale was made, a product_key that indicates the 
product sold, and a "sales" value that indicates the price of 
the sale. Of these values, only the fact_key is unique for a 
given sale, since more than one sale can be made at any 
given time, and any given product type can be the subject of 
more than one sale. 

The database may also store information that is relevant 
to specific months and specific years. It would be inefficient 
to store this information in the fact table 800 itself, since the 
monthly information for a given month would have to be 
duplicated in the row for every sale made in that month. 
Similarly the yearly information for a given year would have 
to be duplicated in the row for every sale made in that year. 

To reduce this duplication, the month and year specific 
information could be broken out into a separate table. FIG. 
813 shows a Time table 802 that includes the month and year 
information. The time__key may indicate, for example, the 
particular day that a sale was made on. Time table 802 

includes one row for every time key value. Each row 

indicates the month and year associated with the time_Jcey, 
and the month information and year information for that 
month and year. 

By placing the month and year specific information in 
time table 802, the need to duplicate this information on a 
per-sale basis is avoided. However, time table 802 still 
duplicates this information on a per-day basis. For example, 
every time_key value associated with month January 1998 
will duplicate the monthly information for month January 
1998. Similarly, every time_key value associated with year 
1998 will duplicate the yearly information for year 1998. 
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Referring to FIG. 8C, it illustrates how the time informa- system inspects the functional dependency metadata to 
lion can be broken out into multiple tables: Time table 804, determine whether it. is possible to use the join back tech- 
Month table 806 and Year table 808. Time table 804 includes niques described above to rewrite the query to access the 
one entry for every time_key value, similar to Time table summary table. Specifically, if the group by column speci- 
802. However, Time table 804 does not attempt to store the 5 fied in the query is functionally dependent on a column in 
month and year specific information. Rather, the month- the summary table, the query can be rewritten to include a 
specific information is stored in Month table 806, that join back between the summary table and the table that 
includes one row for every month, and the year-specific contains the group by column specified in the query. For 
information is stored in Year table 808, that includes one row example, assume that the functional dependency informa- 
for every year. io tion indicates the existence of the following roll up path: 

The process of breaking out information into multiple time id-second-rainute-hour-day-month-year-century 

tables to avoid duplication, as shown in FIG. 8C, is known Assume that the summary table has sale values grouped 

as normalization. Schemas that do not break the information by minute, while a query requires the sales values to be 

out in this manner, as shown in FIG. 8B, are referred to as grouped by century. In this case, the database server reads 

"de-normalized" schemas. 15 the functional dependency metadata to determine that "cen- 

Io de-normalized schemas, the join-back key will often tury" is dependent on "minute", and rewrites the query to 

not be unique, and therefore a complex join back operation roll up the sales values in the summary table to produce sales 

will be required if the query is rewritten to access a summary values grouped by century. 

table. For example, assume that a database includes fact Io a normalized schema, a "time** table may have time_id 

table 800, time table 802, and a summary table ST that 20 and second columns, a "second" table may have second and 

includes the sum of sales by month. Queries that require the minute columns, a "minute" table may have minute and hour 

sum of sales by year that are rewritten to access the summary columns, an "hour" table may have hour and day columns, 

table will have to join back to Time table 802, where the join a "day" table may have day and month columns, a "month" 

is between the month column of the summary table and the table may have month and year columns, and a "year" table 

month column of Table 802. Since the month column of 25 may have year and century columns. Under these conditions, 

Time table 802 is not subject to a uniqueness constraint, a the summary table is joined back to the minute table and 

complex join back is required. eventually to the year table using multiple join conditions: 

In contrast, assume that a database includes fact table 800, ST.minute=minute.minute 

time table 804, month table 806, Year table 808 and a ^ minute. hour=hour. hour 

summary table ST that includes the sum of sales by month. nQur d a y=day day 

Queries that require the sum of sales by year that are ' u *u .u 

rewritten to access the summary table will join back to day.month=month.month 

Month table 806, where the join is between the month month.year=year.year 

column of Month table 806 and the month column of the Io lhis case > the minute toble IS the J om back toWe - 

summary table ST. Since the month column of Month table 35 Assuming that the minute column in the minute table is 

806 is subject to a uniqueness constraint, a simple join back sub 3 ect t0 a uniqueness constraint, only a simple join back 

may be performed. would be required. However, if the minute column of the 

minute table is not subject to a uniqueness constraint, then 

FUNCTIONAL DEPENDENCIES a complex joinback in the rewritten query would be 

When a first column is functionally dependent on a 40 re ^ uired ' ** shown below: 

second column, aggregate values that have been created by 

grouping based on the second column can be used to derive from ^ (select Di;rnNCr nunutcminute, year.century 

aggregate values that are grouped based on the first column. from minute, hour, day, month, year 

For example, the monthly sum of sales can be used to derive where minute.hour - houi.hour 

the yearly sum of sales. Stated another way, monthly sum of an <! hour ? a y = ^y.tay 

i i i_ <i 11 j ji * . i c i and month. year = year.year) v 

sales values can be "rolled up" to create yearly sum of sales y/bsn ST.minute - v.minute 



values. 

Roll up paths are sequences of columns arranged in a 
hierarchy that reflects the functional dependencies between 50 REWRITING OUTER JOINS WITH CHILD-SIDE 
the columns. For example, a "year" column may be func- DIMENSION TABLES 
tionally dependent on a "month" column, which is function- 
ally dependent on a "day" column, which is functionally In the preceding sections, query rewrite techniques have 
dependent on a "time_id" column. These functional depen- been described with respect to queries that include an outer 
dencies constitute the roll-up path of: time_id-day-month- 55 join between fact and dimension tables, where the fact table 
year. is on the child side of the outer join. Under these conditions, 

According to one embodiment, the database system stores the outer join preserves all fact rows, and therefore guaran- 

" functional dependency metadata" that indicates the func- tees that no rows in the fact table will be lost due to the join, 

tional dependencies between columns. The functional However, a user query may specify an outer join between 

dependency metadata may be generated by the database go fact and dimension tables, where the dimension table is on 

system automatically based on primary key constraint the child side of the outerjoin. These types of queries are 

information, or explicitly declared by a user. The functional typical in data warehouse environments. The result of an 

dependency metadata reflect functional dependencies outer join in which the child-side table is a dimension table 

between columns that belong to the same table, as well as and the parent-side table is a fact table preserves all of the 

between columns of different tables. 65 rows in the dimension table. For example, a user may 

When a query requires grouping to be performed based on request the sum-of-sales by product. It is highly desirable to 

values that are not stored in a summary table, the database list all products in such a list, including those products for 
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which no sales occurred. To accomplish this, a query can be 
constructed in which the product table is on the child-side of 
an outer join, and the sales table is on the parent-side of the 
outer join. 

Just as with the other types of queries discussed above, the 
most efficient way to handle these types of queries may be 
to retrieve data from materialized views that are not refer- 
enced in the original queries. The following is an illustration 
of a technique for rewriting such queries to access a mate- 
rialized view that is not referenced by the original query. 
Assume that the materialized view is a summary table that 
is defined as follows: 



create summary sales__summary (prod_id, sum__sales) as 
select products. prod_id, sum(fact^ales) from fact, products 
where fact.prod_id = products.prod_id (+) 
group by factprod__id; 



Assume that the database server receives the following 
query: 

select products.prod_jd sum(fact^ales) from fact, prod- 
uct 

where (+) fact.prod_id=products.prod_id 
group by products.prod__Id 

In this example, the outer join in the summary is different 
from the outer join in the query. Specifically, the summary 
contained the join "fact->products", while the query speci- 
fied the join w products->fact w . Because some rows from the 
products table may have been lost in the "fact->products" 
join of the summary, the summary does not necessarily have 
all of the values from the products. prod__id column. 
However, the query requires sales to be grouped based on 

values from the products.prod id column. Consequently, to 

use the summary table to process the query, the summary 
table is joined back to the products table. 

Upon receiving the query, the database server performs 
the relevant eligibility tests (described above) on the sum- 
mary table to determine whether the summary table is 
eligible to be used to rewrite the query. In the present 
example, the summary satisfies the eligibility tests, and the 
query can be rewritten as follows: 

select products.prod id, sal es__.su mm a ry. sum sales 

from sales__summary, products 

where (+) sales_summary.prod_jd=products.prod_id; 

The outer join (products->summary) in the rewritten 
query accomplishes two things: 1) it eliminates all non- 
matching fact rows that were preserved by the summary, and 
2) it retains all products rows with NULL padding of 
sum-of-sales for all non-matching product rows. 

Significantly, the same rewrite could have been performed 
if the definition of the sales_summary had included an inner 
join instead of an outer join. However, the outer join in the 
rewritten query would not eliminate any non-matching fact 
rows, since the summary would not have any such rows. 

In the foregoing specification, the invention has been 
described with reference to specific embodiments thereof. It 
will, however, be evident that various modifications and 
changes may be made thereto without departing from the 
broader spirit and scope of the invention. The specification 
and drawings are, accordingly, to be regarded in an illus- 
trative rather than a restrictive sense. 

What is claimed is: 

1. A method for processing queries, the method compris- 
ing the steps of: 

receiving a query that does not reference a particular 
materialized view; 
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said query specifying a first set of one or more aggregate 
functions; 

said particular materialized view reflecting a second set of 
one or more aggregate functions; 

determining whether the particular materialized view sat- 
isfies each condition in a set of conditions, the set of 
conditions at least including: 

that each aggregate function in said first set of aggre- 
gate functions be computable from one or more 
corresponding aggregate functions in said second set 
of aggregate functions; and 

that the argument to each aggregate function in said 
first set of aggregate functions be equivalent to the 
argument of the one or more corresponding aggre- 
gate functions in said second set of aggregate func- 
tions; 

wherein said first set of aggregate functions includes a 
particular aggregation function to be applied to a target 
population, wherein said particular aggregation func- 
tion is an aggregation function from a set of aggrega- 
tion functions that consists of variance and standard 
deviation; 

testing whether each aggregate function in said first set of 
aggregate functions is computable from one or more 
corresponding aggregate functions in said second set of 
aggregate functions, wherein said testing includes 
determining whether the particular materialized view 
includes a variance, sum, and count of a source popu- 
lation on which the target population is functionally 
dependent; and 

if said materialized view satisfies each condition in said 
set of conditions, then rewriting said query to produce 
a rewritten query that references said materialized view 
and derives said results of said particular aggregation 
function for said target population from the variance, 
sum, and count of said source population. 

2. The method of claim 1 wherein said particular aggre- 
gate function is variance. 

3. The method of claim 1 wherein said particular aggre- 
gate function is standard deviation. 

4. A method for processing queries, the method compris- 
ing the steps of: 

receiving a query that does not reference a particular 

materialized view; 
said query specifying a first set of one or more aggregate 

functions; 

said particular materialized view reflecting a second set of 
one or more aggregate functions; 

detennining whether the particular materialized view sat- 
isfies each condition in a set of conditions, the set of 
conditions at least including: 

that each aggregate function in said first set of aggre- 
gate functions be computable from one or more 
corresponding aggregate functions in said second set 
of aggregate functions; and 

that the argument to each aggregate function in said 
first set of aggregate functions be equivalent to the 
argument of the one or more corresponding aggre- 
gate functions in said second set of aggregate func- 
tions; and 

testing whether the argument of each aggregate function 
in said first set of aggregate functions is equivalent to 
the argument of one or more corresponding aggregate 
functions in said second set of aggregate functions by 
a) creating a transformed version of the argument by 

transforming the argument of the aggregate function 

to a canonical form; 
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b) creating transformed versions of the arguments of 
the one or more corresponding aggregate functions 
by transforming the arguments of the one or more 
corresponding aggregate functions to a canonical 
form; and 5 

c) comparing the transformed version of the argument 
to the transformed versions of the arguments of the 
one or more corresponding aggregate functions; and 

if said materialized view satisfies each condition in said 
set of conditions, then rewriting said query to produce 10 
a rewritten query that references said materialized 
view. 

5. The method of claim 4 wherein the step of comparing 
the transformed version of the argument to the transformed 
versions of the arguments of the one or more corresponding 
aggregate functions includes performing a byte-to-byte 15 
comparison of the transformed version of the argument to 
the transformed versions of the arguments of the one or more 
corresponding aggregate functions. 

6. The method of claim 4 wherein the step of transforming 
the argument of the aggregate function to a canonical form 20 
includes sorting operands contained within the argument 
that are combined using a commutative operator. 

7. The method of claim 4 wherein the step of creating 
transformed versions of the arguments of the one or more 
corresponding aggregate functions is performed prior to 25 
receiving said query. 

8. The method of claim 7 further comprising the step of 
persistently storing, as metadata associated with said mate- 
rialized view, the transformed versions of the arguments of 
the one or more corresponding aggregate functions. 30 

9. The method of claim 7 wherein the step of creating 
transformed versions of the arguments of the one or more 
corresponding aggregate functions is performed in response 
to creation of said materialized view. 

10. A method for rewriting queries, the method compris- 35 
ing the steps of: 

receiving a query that specifies an outer join that has a 
dimension table on the child-side of the outer join and 
a fact table on the parent-side of the outer join; 

rewriting the query to produce a rewritten query by 
replacing references to the fact table in the query with 
references to a materialized view; 

wherein the materialized view includes a join between the 
dimension table and the fact table; and 45 

wherein the rewritten query specifies an outer join that has 
the dimension table on the child side and the materi- 
alized view on the parent side. 

11. The method of claim 10 wherein the join included in 
the materialized view is an outer join that has the dimension 50 
table on the parent-side and the fact table on the child-side. 

12. The method of claim 10 wherein the join included in 
the materialized view is an inner join that has the dimension 
table on the parent-side and the fact table on the child-side. 



13. The method of claim 10 wherein: 
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the materialized view is a summary table that contains 
values grouped by a particular column in the fact table; 

the query specifies an aggregation that requires values 
grouped by a particular column in the dimension table; 
and 60 

the rewritten query contains an outer join, where the 
particular column of the dimension table the child-side 
join key and a column of the summary table that 
contains values from the particular column of the fact 
table. 65 

14. A computer-readable medium carrying one or more 
sequences of one or more instructions for processing 



queries, the one or more sequences of one or more instruc- 
tions including instructions which, when executed by one or 
more processors, cause the one or more processors to 
perform the steps of: 

receiving a query that does not reference a particular 

materialized view; 
said query specifying a first set of one or more aggregate 
functions; 

said particular materialized view reflecting a second set of 
one or more aggregate functions; 

determining whether the particular materialized view sat- 
isfies each condition in a set of conditions, the set of 
conditions at least including: 

that each aggregate function in said first set of aggre- 
gate functions be computable from one or more 
corresponding aggregate functions in said second set 
of aggregate functions; and 

that the argument to each aggregate function in said 
first set of aggregate functions be equivalent to the 
argument of the one or more corresponding aggre- 
gate functions in said second set of aggregate func- 
tions; 

wherein said first set of aggregate functions includes a 
particular aggregation function to be applied to a target 
population, wherein said particular aggregation func- 
tion is an aggregation function from a set of aggrega- 
tion functions that consists of variance and standard 
deviation; 

testing whether each aggregate function in said first set of 
aggregate functions is computable from one or more 
corresponding aggregate functions in said second set of 
aggregate functions, wherein said testing includes 
determining whether the particular materialized view 
includes a variance, sum, and count of a source popu- 
lation on which the target population is functionally 
dependent; and 

if said materialized view satisfies each condition in said 
set of conditions, then rewriting said query to produce 
a rewritten query that references said materialized view 
and derives said results of said particular aggregation 
function for said target population from the variance, 
sum, and count of said source population. 

15. The computer-readable medium of claim 14 wherein 
said particular aggregate function is variance. 

16. The computer-readable medium of claim 14 wherein 
said particular aggregate function is standard deviation. 

17. A computer-readable medium carrying one or more 
sequences of one or more instructions for processing 
queries, the one or more sequences of one or more instruc- 
tions including instructions which, when executed by one or 
more processors, cause the one or more processors to 
perform the steps of: 

receiving a query that does not reference a particular 

materialized view; 
said query specifying a first set of one or more aggregate 

functions; 

said particular materialized view reflecting a second set of 
one or more aggregate functions; 

determining whether the particular materialized view sat- 
isfies each condition in a set of conditions, the set of 
conditions at least including: 

that each aggregate function in said first set of aggre- 
gate functions be computable from one or more 
corresponding aggregate functions in said second set 
of aggregate functions; and 



08/12/2002, EAST Version: 1.03.0002 



5,991,754 



43 



44 



that the argument to each aggregate function in said 
first set of aggregate functions be equivalent to the 
argument of the one or more corresponding aggre- 
gate functions in said second set of aggregate func- 
tions; and 

testing whether the argument of each aggregate function 
in said first set of aggregate functions is equivalent to 
the argument of one or more corresponding aggregate 
functions in said second set of aggregate functions by 

a) creating a transformed version of the argument by 
transforming the argument of the aggregate function 
to a canonical form; 

b) creating transformed versions of the arguments of 
the one or more corresponding aggregate functions 
by transforming the arguments of the one or more 
corresponding aggregate functions to a canonical 
form; and 

c) comparing the transformed version of the argument 
to the transformed versions of the arguments of the 
one or more corresponding aggregate functions; 

if said materialized view satisfies each condition in said 
set of conditions, then rewriting said query to produce 
a rewritten query that references said materialized 
view. 

18. The computer-readable medium of claim 17 wherein 
the step of comparing the transformed version of the argu- 
ment to the transformed versions of the arguments of the one 
or more corresponding aggregate functions includes per- 
forming a byte-to-byte comparison of the transformed ver- 
sion of the argument to the transformed versions of the 
arguments of the one or more corresponding aggregate 
functions. 

19. The computer-readable medium of claim 17 wherein 
the step of transforming the argument of the aggregate 
function to a canonical form includes sorting operands 
contained within the argument that are combined using a 
commutative operator. 

20. The computer-readable medium of claim 17 wherein 
the step of creating transformed versions of the arguments of 
the one or more corresponding aggregate functions is per- 
formed prior to receiving said query. 

21. The computer-readable medium of claim 20 further 
comprising sequences of instructions for performing the step 
of persistently storing, as metadata associated with said 
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materialized view, the transformed versions of the argu- 
ments of the one or more corresponding aggregate functions. 

22. The computer-readable medium of claim 20 wherein 
the step of creating transformed versions of the arguments of 
the one or more corresponding aggregate functions is per- 
formed in response to creation of said materialized view. 

23. A computer- readable medium carrying one or more 
sequences of one or more instructions for rewriting queries, 
the one or more sequences of one or more instructions 
including instructions which, when executed by one or more 
processors, cause the one or more processors to perform the 
steps of: 

receiving a query that specifies an outer join that has a 
dimension table on the child-side of the outer join and 
a fact table on the parent-side of the outer join; 

rewriting the query to produce a rewritten query by 
replacing references to the fact table in the query with 
references to a materialized view; 

wherein the materialized view includes a join between the 
dimension table and the fact table; and 

wherein the rewritten query specifies an outer join that has 
the dimension table on the child side and the materi- 
alized view on the parent side. 

24. The computer-readable medium of claim 23 wherein 
the join included in the materialized view is an outer join 
that has the dimension table on the parent-side and the fact 
table on the child-side. 

25. The computer-readable medium of claim 23 wherein 
the join included in the materialized view is an inner join 
that has the dimension table on the parent-side and the fact 
table on the child-side. 

26. The computer-readable medium of claim 23 wherein: 
the materialized view is a summary table that contains 

values grouped by a particular column in the fact table; 
the query specifies an aggregation that requires values 
grouped by a particular column in the dimension table; 
and 

the rewritten query contains an outer join, where the 
particular column of the dimension table the child-side 
join key and a column of the summary table that 
contains values from the particular column of the fact 
table. 
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